Claude Code bug can silently 10-20x API costs
TL;DR Highlight
A warning post about two cache-related bugs in Claude Code that can silently spike API costs by up to 10–20x. Users on the $200/month plan are reportedly burning through their limits far faster than expected.
Who Should Read
Developers using Claude Code (Anthropic's AI coding tool) on an API-cost basis, especially those running automated pipelines on the Max plan or via direct API integration.
Core Mechanics
- Claude Code has two cache-related bugs that prevent prompt caching (the feature that reuses previously processed tokens to reduce costs) from working correctly, potentially causing API costs to spike by up to 10–20x.
- The problem occurs silently. Users think they are doing the same work as usual, but in reality the cache is being invalidated and the full context is being reprocessed from scratch every time, causing token costs to grow exponentially.
- One commenter reported using the $200/month Max 5x plan: two days of heavy work analyzing thousands of files across 20 simultaneous sessions consumed only 50% of the limit, but a few hours of simple refactoring and bug fixes wiped out the remaining 50%.
- That same user burned through 100% of their limit in just a few small bug-fix sessions (four ~20-minute sessions, ~45 minutes total work) and had to wait two days for rollover. That volume of work should normally account for only a few percent of the limit.
- A separate comment mentioned a case where Claude Opus 4 hallucinated a non-existent API and kept looping to make tests pass, consuming roughly $12 in 30 minutes—primarily attributed to thinking tokens.
- A similar looping issue was also reported with Gemini, illustrating that runaway cost explosions from infinite loops are a structural risk across AI coding tools in general.
- The community raised questions about whether these charges are for 'unverifiable work,' pointing out that users currently have no practical way to independently audit cache hit status or actual token consumption.
Evidence
- "A Max 5x plan ($200/month) user shared concrete numbers: an extreme imbalance between 2 days of heavy work processing thousands of files across 20 simultaneous sessions (50% consumed) versus a few hours of light refactoring (remaining 50% consumed). The user strongly criticized the situation, saying 'I'm not sure if this is a bug or a silent limit reduction, but this is unacceptable for $200/month.' The Opus 4 hallucination + loop incident is also noteworthy—a model fabricated a non-existent API and looped trying to make tests pass, burning $12 in 30 minutes, with thinking tokens suspected as the primary cause. Cynical comments like 'This is a feature, not a bug' and 'Some PM just hit their 1000% revenue growth KPI' suggest the community views this not just as a bug but as a business incentive problem. It was also pointed out that users currently have no way to independently verify whether cache hits occurred or how many tokens were actually consumed—essentially requiring reverse engineering to validate charges. A similar looping issue reported with Gemini further demonstrates that this is a structural risk across AI coding tools broadly, particularly for agent-based tasks that run autonomously, where infinite loops directly translate to cost explosions."
How to Apply
- "If you use Claude Code for automated pipelines or long-running agent tasks, always check the usage dashboard in the Anthropic console before and after your work to monitor for abnormal token consumption. A sudden large spike in usage after a short task is a signal to suspect the cache bug. For tasks where an agent might loop—such as automated test fixes or iterative code generation and validation—always set a maximum iteration count or total cost cap. Since Claude Code's loop detection and auto-stop functionality is currently incomplete, manually monitor sessions or split work into short session segments for safety. Exercise extra caution when using Opus 4-series models with thinking tokens enabled. Thinking tokens are far more expensive than regular tokens, and if a hallucination triggers a loop, costs can grow exponentially. For cost-sensitive tasks, disable the thinking feature or test first with a cheaper model (Haiku or Sonnet series). Until the cache bugs are fixed, splitting work into short independent sessions is preferable to long sessions that reuse the same context. The longer a session runs, the greater the risk of cost explosion from cache invalidation."
Terminology
Related Papers
Show HN: Bash4LLM+ – A lightweight, dependency-free Bash wrapper for LLM APIs
Python이나 Node.js 없이 순수 Bash만으로 Groq 등 OpenAI 호환 LLM API를 호출할 수 있는 단일 스크립트 도구로, Termux(Android)를 포함한 모든 Unix 환경에서 동작한다.
Wayfinder Router: deterministic routing of queries between local and hosted LLM
프롬프트의 복잡도를 모델 호출 없이 오프라인으로 점수화해서 간단한 쿼리는 로컬 모델로, 어려운 쿼리는 유료 모델로 자동 라우팅하는 CLI 도구다. LLM 비용을 줄이면서도 응답 품질을 유지하고 싶은 개발자에게 유용하다.
Apple Neural Engine: Architecture, Programming, and Performance
Apple 기기에 내장된 AI 전용 칩인 ANE(Apple Neural Engine)를 리버스 엔지니어링으로 분석한 302페이지짜리 기술 문서로, Core ML 아래 숨겨진 내부 구조와 직접 접근 경로를 처음으로 공개한다.
DSpark: Speculative decoding accelerates LLM inference [pdf]
DeepSeek이 Speculative Decoding을 개선한 DSpark 기법을 공개했는데, 같은 시스템 용량 기준으로 사용자당 생성 속도가 57~78% 빨라졌다고 한다. 이게 DeepSeek이 경쟁사 대비 훨씬 싼 가격으로 Pro 모델을 제공할 수 있는 핵심 기술 중 하나일 가능성이 높다.
Show HN: Smart model routing directly in Claude, Codex and Cursor
프롬프트마다 적합한 AI 모델을 50ms 이내에 자동으로 선택해주는 프록시 라우터로, API 비용을 40~70% 절감할 수 있다고 주장하는 오픈소스 도구다. 단, 프롬프트 캐싱 손실 문제로 커뮤니티 반응은 엇갈린다.
Show HN: Overfitted a 900KB Transformer to Compress a 100MB CSV into 7MB
단일 파일을 통째로 암기하도록 Transformer를 과적합(overfitting)시킨 뒤 arithmetic coding으로 압축하는 실험으로, 100MB CSV를 7MB(~0.5 bits/byte)까지 줄이는 데 성공했다. 모델이 '범용 이해' 대신 '특정 파일 완전 암기'를 목표로 한다는 점에서 전통적 ML 학습과 정반대 방향이라 흥미롭다.