Claude Sonnet 4 now supports 1M tokens of context

TL;DR Highlight

Claude Sonnet 4's context window expanded 5x from 200K to 1M tokens, with a new tiered pricing model that doubles input costs above 200K.

Who Should Read

Developers using Claude API for large codebase analysis or multi-step agents, or developers using Claude Code who struggle with context management.

Core Mechanics

Claude Sonnet 4 context window expanded from 200K to 1M tokens — roughly 75,000+ lines of code in a single context.
Tiered pricing introduced: $3/MTok input and $15/MTok output up to 200K tokens; above 200K, input jumps to $6/MTok and output to $22.50/MTok. First non-linear pricing in the LLM industry.
Concerns that longer context makes LLMs 'distracted' and reduces output quality — without evals proving accuracy at long context lengths, cost-to-value is hard to judge.
Prompt caching is essential for managing costs with the larger context window.

Evidence

Community skepticism about long-context claims — previous 200K context already had quality degradation reports, and 1M claims need evaluation data to be credible.
The 2x price jump above 200K creates a natural incentive to optimize context usage and stay under the threshold when possible.
Double-escape checkpointing in Claude Code helps manage context efficiently during long sessions.

How to Apply

For agents handling large codebases, feed full source into context but always pair with prompt caching to reduce repeated call costs. If you can stay under 200K tokens, prices are half — so evaluate context pruning strategies first.
In Claude Code, use double-escape checkpointing after filling context to manage long sessions efficiently.
Consider the 200K threshold as a budget boundary — for many tasks, aggressive summarization and selective file inclusion can keep you in the cheaper tier.

Terminology

Context WindowThe maximum text an LLM can read at once. 1M tokens is roughly 750K words or ~10 novels' worth of text in a single prompt.

Prompt CachingThe server caches parts of previously sent prompts, skipping recomputation on subsequent requests with the same prefix — reducing both cost and latency.

Batch ProcessingSending multiple requests in bulk for processing, typically at a discount. Useful for non-real-time workloads.