Claude Code users hitting usage limits 'way faster than expected'
TL;DR Highlight
A prompt cache bug in Anthropic's AI coding assistant Claude Code has been confirmed to cause 10–20x token overconsumption, with users burning through $100–$200/month plans within hours.
Who Should Read
Developers subscribed to Claude Code or Claude Pro/Max plans who use them for everyday development tasks or automated workflows. This is especially critical reading for anyone integrating Claude Code into CI/CD pipelines or repetitive automated tasks.
Core Mechanics
- Anthropic has officially acknowledged the issue, stating that 'users of Claude Code are hitting usage limits much faster than expected, and the team is currently investigating this as a top priority.'
- A user reverse-engineered the Claude Code binary and identified the root cause: when the words 'billing' or 'tokens' appear in a conversation, Claude Code internally replaces certain text, which invalidates the prompt cache (a feature that reuses previous processing results on repeated requests to reduce costs). As a result, the cache is rebuilt from scratch on every request, inflating costs by 10–20x.
- User reports confirm a noticeable difference after downgrading. A specific case was shared where downgrading to version 2.1.34 made a visible difference, and several other users corroborated this.
- A quota policy change compounded the bug. On March 28, Anthropic reduced peak-hour quotas and also ended a promotion that had doubled off-peak usage allowances. These two changes, combined with the bug, appear to have made the perceived burn rate dramatically worse.
- The default prompt cache TTL of only 5 minutes is also a hidden cost factor. Stepping away briefly or pausing work for more than 5 minutes causes the cache to expire, leading to a cost spike on resumption. A 1-hour cache upgrade option exists, but its write cost is 2x the base input token rate, creating a trade-off.
- Warnings have emerged about particular risks in automated workflows. Rate limit errors look like ordinary failures, triggering automatic retries, and users have reported a single session inside a loop consuming an entire daily budget within minutes.
- The opacity of Anthropic's plan limit information makes the problem worse. The Pro plan only states 'at least 5x usage vs. free,' and Standard Team only says '1.25x vs. Pro.' With no way to know actual token or request counts in advance, users have no option but to monitor their dashboard in real time.
Evidence
- "Complaints have emerged that despite the bug being officially confirmed, there has been no mention of refunds or compensation — with the sentiment that 'since it's been verified as a bug, refunds or credits should be warranted, but nothing will happen unless you actively push back,' receiving widespread agreement. Suspicions of intentional A/B testing were also raised but ultimately resolved as a bug; some users suspected it was a deliberate experiment to test user tolerance for reduced limits, but reverse engineering confirmed it as a cache invalidation bug, though distrust of Anthropic's opaque communication style persisted. Criticism of blind loyalty to Claude was also notable, with one comment suggesting 'users probably can't tell the difference if Sonnet and Opus are swapped — it's like preferring a $100 wine over a $10 one without being able to taste the difference,' and some users expressed willingness to try alternative models such as kimi and qwen3-coder-next. A wave of cancellation stories followed, with one user reporting they had been paying $40/month combined for Pro and API plans before canceling last month, noting sessions had been getting progressively shorter since December, to the point where just a few prompts now hits the limit; another user shared the baffling experience of hitting their limit after asking only two questions in a day. Practical tips on context management were also shared, with users noting that while research papers suggest context rot (quality degradation over long conversations) is not a problem, actively managing context in practice improves both quality and cost — and that manually controlling context via the Web UI proved more efficient than using Claude Code."
How to Apply
- "If you are integrating Claude Code into CI/CD pipelines or automation scripts, you must handle rate limit errors explicitly as a separate case. Because rate limit errors currently look identical to ordinary failures, infinite retry loops can occur — so explicitly check the error response type and add backoff logic along with a hard stop condition when the daily budget is exceeded. Downgrading to Claude Code version 2.1.34 can reduce the overconsumption problem in the short term. This can serve as a temporary workaround until a fixed version is released, and real users have reported a noticeable improvement after downgrading. Be careful to avoid the keywords 'billing,' 'token,' and usage-related terms in your conversations. Since the bug is triggered by internal text replacement when these keywords appear — invalidating the cache — you should ensure such words do not appear in system prompts or conversation history, especially in long automated conversations. Parallel-testing alternative models now is a good way to spread risk. Trying models mentioned in the community such as kimi and qwen3-coder-next (which can be run locally) on small-scale tasks and directly comparing quality and cost will help you build a development environment that is less dependent on Anthropic's policy changes."
Terminology
Related Papers
Show HN: Bash4LLM+ – A lightweight, dependency-free Bash wrapper for LLM APIs
Python이나 Node.js 없이 순수 Bash만으로 Groq 등 OpenAI 호환 LLM API를 호출할 수 있는 단일 스크립트 도구로, Termux(Android)를 포함한 모든 Unix 환경에서 동작한다.
Wayfinder Router: deterministic routing of queries between local and hosted LLM
프롬프트의 복잡도를 모델 호출 없이 오프라인으로 점수화해서 간단한 쿼리는 로컬 모델로, 어려운 쿼리는 유료 모델로 자동 라우팅하는 CLI 도구다. LLM 비용을 줄이면서도 응답 품질을 유지하고 싶은 개발자에게 유용하다.
Apple Neural Engine: Architecture, Programming, and Performance
Apple 기기에 내장된 AI 전용 칩인 ANE(Apple Neural Engine)를 리버스 엔지니어링으로 분석한 302페이지짜리 기술 문서로, Core ML 아래 숨겨진 내부 구조와 직접 접근 경로를 처음으로 공개한다.
DSpark: Speculative decoding accelerates LLM inference [pdf]
DeepSeek이 Speculative Decoding을 개선한 DSpark 기법을 공개했는데, 같은 시스템 용량 기준으로 사용자당 생성 속도가 57~78% 빨라졌다고 한다. 이게 DeepSeek이 경쟁사 대비 훨씬 싼 가격으로 Pro 모델을 제공할 수 있는 핵심 기술 중 하나일 가능성이 높다.
Show HN: Smart model routing directly in Claude, Codex and Cursor
프롬프트마다 적합한 AI 모델을 50ms 이내에 자동으로 선택해주는 프록시 라우터로, API 비용을 40~70% 절감할 수 있다고 주장하는 오픈소스 도구다. 단, 프롬프트 캐싱 손실 문제로 커뮤니티 반응은 엇갈린다.
Show HN: Overfitted a 900KB Transformer to Compress a 100MB CSV into 7MB
단일 파일을 통째로 암기하도록 Transformer를 과적합(overfitting)시킨 뒤 arithmetic coding으로 압축하는 실험으로, 100MB CSV를 7MB(~0.5 bits/byte)까지 줄이는 데 성공했다. 모델이 '범용 이해' 대신 '특정 파일 완전 암기'를 목표로 한다는 점에서 전통적 ML 학습과 정반대 방향이라 흥미롭다.
Related Resources
- Anthropic admits Claude Code quotas running out too fast • The Register
- Tweet on cache invalidation bug discovery (reverse engineering findings)
- Reddit ClaudeCode community - Anthropic official investigation announcement thread
- Reddit ClaudeAI community - Usage limit issue discussion
- Substack post on the opacity of AI credit systems