Claude Code daily benchmarks for degradation tracking
TL;DR Highlight
Marginlab runs automated daily SWE-Bench-Pro benchmarks on Claude Code (Opus 4.6) and uses statistical methods to detect meaningful performance regressions.
Who Should Read
ML engineers running production AI systems who need to track model performance over time, and researchers studying LLM evaluation methodology.
Core Mechanics
- Marginlab built an automated daily benchmarking pipeline that runs Claude Code (Opus 4.6) against SWE-Bench-Pro, producing statistically rigorous performance tracking.
- The key contribution is the statistical methodology: they use proper significance testing rather than just comparing raw scores, because SWE-bench has high variance per run.
- Findings showed that Claude Code's performance has measurable fluctuations over time — not always in the direction of improvement.
- This kind of continuous benchmarking is valuable because model updates (including undisclosed changes) can silently affect production workloads.
- SWE-Bench-Pro is a harder variant of SWE-bench with less benchmark contamination risk — better for detecting genuine capability changes.
- The pipeline is open-sourced, enabling others to run similar continuous benchmarks against their own models or tools.
Evidence
- The statistical methodology is clearly described — they use confidence intervals and multiple test runs rather than single-point measurements, making the results more credible.
- HN discussion raised the issue that benchmark scores fluctuate due to temperature/sampling as much as actual model changes — Marginlab's approach accounts for this.
- Several ML engineers noted this is exactly the kind of infrastructure they wish existed for their own production model monitoring.
- Debate about whether SWE-bench performance correlates well with real-world coding agent performance — acknowledged as imperfect but the best available standardized metric.
How to Apply
- If you run AI coding agents in production, set up a continuous benchmark against a representative task suite — model performance can change between API updates.
- Use statistical significance testing when comparing model performance: a score difference of 2-3% may be noise; Marginlab's methodology shows how to tell the difference.
- Consider subscribing to or replicating Marginlab's pipeline for model regression alerts — especially useful before/after a model version update.
- Track your own internal task distribution against public benchmarks to understand how well SWE-bench scores predict your specific workload performance.
Code Example
# Update Claude Code CLI to the latest version
claude update
# Check the currently installed version
claude --versionTerminology
Related Papers
Can LLMs model real-world systems in TLA+?
LLM이 TLA+ 명세를 작성할 때 문법은 잘 통과하지만 실제 시스템과의 동작 일치도(conformance)는 46% 수준에 그친다는 걸 체계적으로 검증한 벤치마크 연구로, AI 기반 형식 검증의 현실적 한계를 보여준다.
Natural Language Autoencoders: Turning Claude's Thoughts into Text
Anthropic이 LLM 내부의 숫자 벡터(활성화값)를 직접 읽을 수 있는 자연어로 변환하는 NLA 기법을 공개했다. AI가 실제로 무슨 생각을 하는지 해석하는 interpretability 연구의 새로운 진전이다.
ProgramBench: Can language models rebuild programs from scratch?
LLM이 FFmpeg, SQLite, PHP 인터프리터 같은 실제 소프트웨어를 문서만 보고 처음부터 재구현할 수 있는지 측정하는 새 벤치마크로, 최고 모델도 전체 태스크의 3%만 95% 이상 통과하는 수준에 그쳤다.
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
티켓 3장으로 쪼개면 Claude/GPT도 보안 취약점 코드를 53~86% 확률로 그냥 짜준다.
Refusal in Language Models Is Mediated by a Single Direction
Open-source chat models encode safety as a single vector direction, and removing it disables safety fine-tuning.
Show HN: A new benchmark for testing LLMs for deterministic outputs
Structured Output Benchmark assesses LLM JSON handling across seven metrics, revealing performance beyond schema compliance.