1M context is now generally available for Opus 4.6 and Sonnet 4.6
TL;DR Highlight
Anthropic rolled out 1M token context windows for Opus 4.6 and Sonnet 4.6 — this changes what's practical for long-context tasks.
Who Should Read
Developers building applications with large documents, long conversation histories, or codebases that need to be processed as a single context, and ML engineers benchmarking long-context performance.
Core Mechanics
- Claude Opus 4.6 and Sonnet 4.6 now support 1 million token context windows — enough for entire medium-sized codebases, very long books, or months of conversation history.
- 1M tokens is approximately 750,000 words or roughly 3,000 pages of text.
- This makes certain use cases that previously required RAG or chunking feasible as direct in-context tasks: analyzing a full codebase, processing large legal document sets, or maintaining very long agent memory.
- The key question is whether the model's attention quality degrades in the middle of a 1M token context (the 'lost in the middle' problem) — early reports suggest Anthropic has made improvements here.
- Pricing at this scale becomes a significant consideration: 1M tokens of input is expensive relative to a well-tuned RAG retrieval that only brings in the relevant 10k tokens.
Evidence
- Anthropic announced the 1M context expansion with API availability, confirmed through the API documentation.
- HN commenters ran their own tests — feeding in full codebases and asking questions across the entire codebase. Results were generally positive for code navigation tasks.
- Some found that retrieval quality degrades for context items in the 'middle' of a very long context, consistent with the known 'lost in the middle' problem in long-context models.
- Cost comparisons showed that for high-recall tasks, 1M context could actually be cheaper than complex RAG pipelines with re-retrieval and reranking.
How to Apply
- For codebase analysis and navigation tasks, try loading the entire relevant codebase into a single 1M context request before investing in RAG-based code search.
- For long-document processing (legal, research, financial), test whether direct 1M context gives better answers than chunked RAG — quality may outweigh cost for high-value queries.
- Structure prompts for 1M contexts carefully: put the most important content at the beginning or end, not the middle — attention tends to be strongest there.
- Monitor costs carefully: 1M token inputs at current pricing can be expensive at scale — model a full-context approach vs. RAG cost/quality tradeoff before committing to architecture.
Terminology
Related Papers
Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s
Apple Silicon에서 Swift로 직접 행렬 곱셈 커널을 구현하며 CPU, SIMD, AMX, GPU(Metal)를 단계별로 최적화해 Gflop/s에서 Tflop/s 수준까지 성능을 높이는 과정을 상세히 설명한 글이다. 프레임워크 없이 LLM 학습의 핵심 연산을 밑바닥부터 구현하고 싶은 개발자에게 Apple Silicon의 성능 한계를 체감할 수 있는 드문 자료다.
Removing fsync from our local storage engine
FractalBits가 fsync 없이 SSD 전용 KV 스토리지 엔진을 구현해 동일 조건 대비 약 65% 높은 쓰기 성능을 달성한 설계 방법을 공유했다. fsync의 메타데이터 오버헤드를 피하기 위해 사전 할당, O_DIRECT, SSD 원자 쓰기 단위 정렬 저널을 조합한 구조가 핵심이다.
Google Chrome silently installs a 4 GB AI model on your device without consent
Google Chrome이 사용자 동의 없이 Gemini Nano 4GB 모델 파일을 자동 다운로드하고, 삭제해도 재다운로드되는 문제가 발견됐다. GDPR 위반 가능성과 수십억 대 기기에 적용될 때의 환경 비용 문제가 제기되고 있다.
How OpenAI delivers low-latency voice AI at scale
OpenAI redesigned its WebRTC stack to serve real-time voice AI to over 900 million users, detailing the design decisions and trade-offs of a relay + transceiver split architecture.
Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees
Deterministic Leaf Enumeration (DLE) cuts self-consistency’s redundant sampling by deterministically exploring a tree of possible sequences, simultaneously improving math/code reasoning performance and speed.