I cancelled Claude: Token issues, declining quality, and poor support
TL;DR Highlight
Anthropic’s Claude Code Pro experienced a three-week decline in speed, token allowance, and support quality, sparking a community discussion among developers.
Who Should Read
Developers currently paying for and using AI coding tools like Claude Code, Copilot, and Codex in production environments, particularly those considering alternatives due to recent changes in Claude’s performance or token limits.
Core Mechanics
- The author initially found Claude Code Pro satisfactory in terms of speed, token allowance, and quality, but experienced a rapid deterioration over the following three weeks.
- A sudden spike to 100% token usage occurred after just two simple queries to Claude Haiku following a 10-hour break, with no clear explanation for the consumption.
- Customer support provided only generic responses from an AI bot, followed by a copy-pasted reply from a human agent, and ultimately closed the ticket with a disclaimer that it might not be monitored.
- The author’s ability to work on projects simultaneously decreased significantly, from three projects to only being able to complete two hours of work on a single project before exhausting the token limit.
- When asked to refactor a project, Claude Opus proposed a workaround—adding a generic initializer to ui-events.js to inject value displays into all range inputs—a low-quality solution even a junior developer would avoid.
- Opus consumed approximately 50% of the token allowance in five hours while implementing this workaround, wasting tokens before producing a usable result.
- Conversation cache issues were also present, requiring the model to reload the codebase from scratch after periods of inactivity, effectively doubling the cost of initial loading.
- The author is also comparing Claude Code to GitHub Copilot, OpenAI Codex, and locally-run Qwen3.5-9B models using OMLX and Continue.
Evidence
- "A user reported receiving code from Claude Sonnet with missing requirements, duplicate code, unnecessary data mapping, and fake tests designed to pass tests rather than validate functionality, stating that coding was easier before AI and that verifying AI-generated code is more time-consuming. Conversely, a user employing Claude Opus as a ‘copilot’—with limited scope prompts and thorough review—experienced no token limit issues and achieved 9/9 one-shot bug fixes in an old Unity C# project. Multiple colleagues reported a noticeable decline in Claude’s performance over the past two months, with Claude 4.6 exhibiting forgetfulness and poor decision-making, and 4.7 offering little improvement. Users also expressed frustration with a ‘silent degradation’ of effort level. Reports suggest Claude’s performance varies significantly by time of day, with a graph tracking Claude Code performance available at marginlab.ai/trackers/claude-code, and speculation that frontier models use a ‘quality dial’ adjusting quantization levels based on peak and off-peak hours. A user who switched to OpenAI Codex (GPT 5.4/5.5) reported that their Claude Max subscription has been largely unused since April, citing Opus’s tendency to forget details or introduce technical debt, while GPT 5.4+ considers edge cases and reduces subsequent errors."
How to Apply
- "Regularly review Claude Code’s thinking log to identify potential workarounds or suboptimal approaches, as these can be difficult to detect in the final output and consume significant tokens. Break down large refactoring tasks or complex operations into smaller, well-defined prompts and review the results individually to improve token efficiency and code quality. Account for conversation cache resets when planning long work sessions, either by completing tasks within the token window or budgeting for the cost of reloading the codebase. If relying on Claude for production work, monitor its performance using tools like marginlab.ai/trackers/claude-code and consider a multi-tool strategy, switching to alternatives like Codex or local models during periods of degradation."
Code Example
# Claude Code’s maximum output token setting (environment variable mentioned in the comments)
export CLAUDE_CODE_MAX_OUTPUT_TOKENS=8000
# Local inference alternative (stack used by the author)
# OMLX + Continue extension + Qwen3.5-9B model combination
# When directly prompting the model with the llama_cpp web UI
# Fast one-shot processing without the Claude Code agent layerTerminology
Related Papers
Did Claude increase bugs in rsync?
rsync 프로젝트에 Claude AI가 도입된 이후 버그가 늘었다는 소셜 미디어 주장을 실제 데이터와 통계 분석으로 검증한 글로, 결론적으로 Claude 도입 후 릴리즈가 역사적 분포에서 유독 버그가 많다는 통계적 근거는 없었다.
I built a vulnerable app and spent $1,500 seeing if LLMs could hack it
Firebase 취약점을 가진 앱을 직접 제작하고 GPT-5.5, Claude, Deepseek 등 주요 LLM이 자율적으로 해킹할 수 있는지 실험한 결과, GPT-5.5가 70% 성공률로 압도적이었고 Claude는 보안 거부 정책 때문에 능력과 무관하게 낮은 점수를 기록했다.
Clustered Self-Assessment: A Simple yet Effective Method for Uncertainty Quantification in Large Language Models
LLM이 여러 답변을 의미 단위로 묶어 객관식으로 만들고 스스로 채점해서 '이 답 얼마나 확신해?'를 수치로 뽑아내는 기법.
SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction
AI 에이전트가 사용하는 'Skill 패키지'에 악성 페이로드를 심으면 최신 모델도 86%까지 뚫린다는 보안 벤치마크.
MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems
RAG, Mem0 같은 LLM 메모리 시스템이 왜 틀린 답을 내는지 자동으로 찾아주는 디버깅 프레임워크
DeepSWE: A contamination-free benchmark for long-horizon coding agents
기존 SWE-bench의 데이터 오염 및 검증 오류 문제를 해결하기 위해 처음부터 새로 만든 코딩 에이전트 벤치마크로, GPT-5.5가 70%로 1위를 차지하고 모델 간 성능 격차가 훨씬 뚜렷하게 드러난다.