Claude Token Counter, now with model comparisons
TL;DR Highlight
Anthropic’s Claude Opus 4.7 consumes up to 46% more tokens than its predecessor on the same input due to a tokenizer change, effectively raising costs.
Who Should Read
Developers operating services with the Claude API, particularly backend/AI developers considering or already using Opus 4.7 and needing precise cost impact analysis.
Core Mechanics
- Simon Willison’s Claude Token Counter now compares token counts across models, simultaneously supporting Opus 4.7, Opus 4.6, Sonnet 4.6, and Haiku 4.5.
- Claude Opus 4.7 marks Anthropic’s first model to undergo a tokenizer change, potentially converting the same input into 1.0 to 1.35 times more tokens.
- Testing with a system prompt revealed Opus 4.7 generated 1.46 times more tokens than Opus 4.6, exceeding Anthropic’s stated range of 1.35x.
- Despite maintaining the same pricing ($5 per million input tokens, $25 per million output tokens as Opus 4.6), the increased token count results in a real cost increase of over 40%.
- Testing with a high-resolution image (3456x2234 pixels, 3.7MB PNG) showed Opus 4.7 generating 3.01 times more tokens than Opus 4.6, due to enhanced Vision capabilities supporting images up to 2,576 pixels.
- Conversely, smaller images (682x318) showed negligible token differences between Opus 4.7 (314 tokens) and 4.6 (310 tokens), indicating the increase stems from high-resolution support, not the tokenizer itself.
- A 15MB, 30-page text-centric PDF resulted in Opus 4.7 generating 60,934 tokens versus 56,482 for 4.6, a 1.08x difference—a smaller increase than observed with images.
- The token counting API requires a Claude API key and allows pre-checking expected token counts for each model by specifying the model ID.
Evidence
- "Critics labeled the tokenizer change a ‘money grab,’ citing Anthropic’s lack of transparency regarding the reasons or methodology behind the alteration. Technical counterarguments suggest the change could be an intentional design for performance improvements, potentially improving inference quality by breaking down text into more meaningful units. Speculation also arose about replacing the tokenizer with a smaller learning model, similar to Byte Latent Transformer. Data from tokens.billchambers.me/leaderboard shows large-scale comparisons between 4.6 and 4.7, with one user reporting a 40% increase in tokens for their prompts. Practical experience reveals that token costs escalate in agent systems due to re-transmitting the entire context (including previous tool call results) upon timeouts, potentially consuming three times the tokens for a failed API call. Developers are responding by maintaining the default model in Claude CLI as 4.6 and using the `--model claude-opus-4-7` flag only when necessary, and by downsampling high-resolution images before upload."
How to Apply
- "If considering migrating to Opus 4.7, pre-measure the token cost increase for your existing system prompts and representative inputs using Simon Willison’s Claude Token Counter (https://tools.simonwillison.net/claude-token-counter). If upgrading image processing pipelines to Opus 4.7, pre-resize images to 682x318 if high resolution isn’t essential to maintain token costs comparable to Opus 4.6. When using Claude CLI or API, separate models based on task complexity to manage costs, using Sonnet 4.6 or Haiku 4.5 as defaults and specifying `--model claude-opus-4-7` only for complex tasks. For agent systems, monitor tokens at both the token and action levels; track whether side effects actually executed to reduce unnecessary re-attempts and minimize token waste."
Terminology
Related Papers
Show HN: Lowfat – pluggable CLI filter that saved 91.8% of my LLM tokens
AI 에이전트가 CLI 명령어 출력을 읽을 때 불필요한 노이즈를 제거해 토큰 사용량을 줄여주는 Rust 기반 CLI 필터 도구. Claude Code, OpenCode 등 주요 AI 코딩 에이전트와 통합 가능하다.
1-Bit Bonsai Image 4B Image Generation for Local Devices
4B 파라미터 이미지 생성 모델의 가중치를 1비트/3값으로 극단적으로 압축해서 iPhone에서도 돌아가게 만든 모델. 7.75GB짜리 diffusion transformer를 0.93GB까지 줄였다.
Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA
vLLM의 핵심 기능을 C++와 CUDA로 직접 구현하며 배울 수 있는 교육용 LLM 추론 엔진 프로젝트로, 소스코드와 단계별 강의가 함께 제공된다.
Real-time LLM Inference on Standard GPUs: 3k tokens/s per request
Kog AI가 8× AMD MI300X에서 요청당 3,000 tokens/s를 달성하는 LLM 추론 엔진을 공개했고, 기존 소프트웨어 스택의 병목을 GPU 메모리 대역폭 최대화로 풀어냈다는 내용이다.
A sleep-like consolidation mechanism for LLMs
LLM이 긴 컨텍스트를 처리할 때 발생하는 Attention 비용 문제를 해결하기 위해, 사람의 수면처럼 주기적으로 컨텍스트를 fast weight에 압축·저장하는 새로운 메커니즘을 제안한 논문이다.
CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs
GPU에서 Transformer 학습 시 발생하는 메모리 병목을 해결하기 위해, 정규화·활성화 등 소규모 연산들을 GEMM 출력이 칩 위에 있는 동안 함께 실행하는 커널 추상화 CODA를 소개한다. LLM이 이 추상화를 활용해 고성능 커널을 자동 생성할 수 있다는 점이 특히 주목받고 있다.