Claude Code users hitting usage limits 'way faster than expected'

TL;DR Highlight

A prompt cache bug in Anthropic's AI coding assistant Claude Code has been confirmed to cause 10–20x token overconsumption, with users burning through $100–$200/month plans within hours.

Who Should Read

Developers subscribed to Claude Code or Claude Pro/Max plans who use them for everyday development tasks or automated workflows. This is especially critical reading for anyone integrating Claude Code into CI/CD pipelines or repetitive automated tasks.

Core Mechanics

Anthropic has officially acknowledged the issue, stating that 'users of Claude Code are hitting usage limits much faster than expected, and the team is currently investigating this as a top priority.'
A user reverse-engineered the Claude Code binary and identified the root cause: when the words 'billing' or 'tokens' appear in a conversation, Claude Code internally replaces certain text, which invalidates the prompt cache (a feature that reuses previous processing results on repeated requests to reduce costs). As a result, the cache is rebuilt from scratch on every request, inflating costs by 10–20x.
User reports confirm a noticeable difference after downgrading. A specific case was shared where downgrading to version 2.1.34 made a visible difference, and several other users corroborated this.
A quota policy change compounded the bug. On March 28, Anthropic reduced peak-hour quotas and also ended a promotion that had doubled off-peak usage allowances. These two changes, combined with the bug, appear to have made the perceived burn rate dramatically worse.
The default prompt cache TTL of only 5 minutes is also a hidden cost factor. Stepping away briefly or pausing work for more than 5 minutes causes the cache to expire, leading to a cost spike on resumption. A 1-hour cache upgrade option exists, but its write cost is 2x the base input token rate, creating a trade-off.
Warnings have emerged about particular risks in automated workflows. Rate limit errors look like ordinary failures, triggering automatic retries, and users have reported a single session inside a loop consuming an entire daily budget within minutes.
The opacity of Anthropic's plan limit information makes the problem worse. The Pro plan only states 'at least 5x usage vs. free,' and Standard Team only says '1.25x vs. Pro.' With no way to know actual token or request counts in advance, users have no option but to monitor their dashboard in real time.

Evidence

"Complaints have emerged that despite the bug being officially confirmed, there has been no mention of refunds or compensation — with the sentiment that 'since it's been verified as a bug, refunds or credits should be warranted, but nothing will happen unless you actively push back,' receiving widespread agreement. Suspicions of intentional A/B testing were also raised but ultimately resolved as a bug; some users suspected it was a deliberate experiment to test user tolerance for reduced limits, but reverse engineering confirmed it as a cache invalidation bug, though distrust of Anthropic's opaque communication style persisted. Criticism of blind loyalty to Claude was also notable, with one comment suggesting 'users probably can't tell the difference if Sonnet and Opus are swapped — it's like preferring a $100 wine over a $10 one without being able to taste the difference,' and some users expressed willingness to try alternative models such as kimi and qwen3-coder-next. A wave of cancellation stories followed, with one user reporting they had been paying $40/month combined for Pro and API plans before canceling last month, noting sessions had been getting progressively shorter since December, to the point where just a few prompts now hits the limit; another user shared the baffling experience of hitting their limit after asking only two questions in a day. Practical tips on context management were also shared, with users noting that while research papers suggest context rot (quality degradation over long conversations) is not a problem, actively managing context in practice improves both quality and cost — and that manually controlling context via the Web UI proved more efficient than using Claude Code."

How to Apply

"If you are integrating Claude Code into CI/CD pipelines or automation scripts, you must handle rate limit errors explicitly as a separate case. Because rate limit errors currently look identical to ordinary failures, infinite retry loops can occur — so explicitly check the error response type and add backoff logic along with a hard stop condition when the daily budget is exceeded. Downgrading to Claude Code version 2.1.34 can reduce the overconsumption problem in the short term. This can serve as a temporary workaround until a fixed version is released, and real users have reported a noticeable improvement after downgrading. Be careful to avoid the keywords 'billing,' 'token,' and usage-related terms in your conversations. Since the bug is triggered by internal text replacement when these keywords appear — invalidating the cache — you should ensure such words do not appear in system prompts or conversation history, especially in long automated conversations. Parallel-testing alternative models now is a good way to spread risk. Trying models mentioned in the community such as kimi and qwen3-coder-next (which can be run locally) on small-scale tasks and directly comparing quality and cost will help you build a development environment that is less dependent on Anthropic's policy changes."

Terminology

Prompt CacheA feature that stores the results of processing a given input so they can be reused when the same input is encountered again. Similar to caching DB query results, it avoids recomputing the same system prompt or long context from scratch every time, saving both cost and latency.

Cache InvalidationThe process by which a stored cache entry is deemed invalid and discarded. In this bug, certain keywords trigger an unintended cache invalidation, forcing the entire context to be recomputed from scratch.

Context RotA phenomenon where quality degrades or early context gets diluted as a conversation with an AI grows longer. Like the telephone game, responses start drifting from the original intent as the conversation extends.

Rate LimitThe upper bound on the number of API requests or tokens that can be used within a given time window. Exceeding it causes the service to reject responses; in this case, the error format is indistinguishable from ordinary failures, causing automatic retries to trigger.

TokenThe basic unit by which AI models process text. Roughly 1–2 Korean characters or about 3/4 of an English word; both input and output are billed based on token count.

QuotaThe usage limit assigned per subscription plan. Anthropic does not publish exact figures, expressing limits only as multiples of the free tier, making it difficult for users to calculate how much headroom they have left.