Are LLM merge rates not getting better?
TL;DR Highlight
Re-analyzing METR's SWE-bench data shows LLM code quality (actual PR merge rates) lags far behind benchmark scores — the gap between benchmark and reality is widening.
Who Should Read
Engineers evaluating LLMs for software development, researchers studying AI coding benchmarks, and anyone trying to assess how much to trust SWE-bench-style metrics.
Core Mechanics
- METR's SWE-bench dataset includes metadata about whether human-submitted PRs were actually merged — a real-world quality signal.
- Re-analysis found that LLM-generated solutions that 'pass' SWE-bench tests have significantly lower actual merge rates than human PRs at the same task difficulty.
- This suggests SWE-bench scores overstate practical code quality: models are optimizing for test passage (which can be gamed) rather than code that a human reviewer would approve.
- The gap is larger on harder tasks: for simple bug fixes, LLM and human quality are closer; for complex architectural changes, the gap widens substantially.
- This is a benchmark contamination/Goodhart's law dynamic: once a benchmark becomes a target, models optimize for it in ways that diverge from the underlying quality it was meant to measure.
Evidence
- The analysis compared SWE-bench test pass rates against the historical PR merge rates for the same tasks, finding a substantial divergence.
- HN commenters with experience evaluating LLMs for production code confirmed this matches their empirical experience — models that ace benchmarks often produce code that fails code review.
- Some noted that SWE-bench uses a specific set of open-source repos that models have likely seen in training, adding data contamination as another validity concern.
- Researchers noted this highlights the need for 'lived benchmarks' — evaluations based on real-world usage rather than held-out test sets.
How to Apply
- Don't rely solely on SWE-bench scores when evaluating LLMs for production coding use cases — run your own evals on code samples representative of your actual codebase.
- When building LLM-assisted code review pipelines, instrument your actual merge/rejection rates for AI-suggested code — this is the ground truth metric SWE-bench approximates.
- For researchers: designing evaluations that can't easily be gamed (test-time optimization) is the key challenge — consider proxies like code reviewer acceptance or downstream test coverage rather than just unit test passage.
Terminology
Related Papers
Can LLMs model real-world systems in TLA+?
LLM이 TLA+ 명세를 작성할 때 문법은 잘 통과하지만 실제 시스템과의 동작 일치도(conformance)는 46% 수준에 그친다는 걸 체계적으로 검증한 벤치마크 연구로, AI 기반 형식 검증의 현실적 한계를 보여준다.
Natural Language Autoencoders: Turning Claude's Thoughts into Text
Anthropic이 LLM 내부의 숫자 벡터(활성화값)를 직접 읽을 수 있는 자연어로 변환하는 NLA 기법을 공개했다. AI가 실제로 무슨 생각을 하는지 해석하는 interpretability 연구의 새로운 진전이다.
ProgramBench: Can language models rebuild programs from scratch?
LLM이 FFmpeg, SQLite, PHP 인터프리터 같은 실제 소프트웨어를 문서만 보고 처음부터 재구현할 수 있는지 측정하는 새 벤치마크로, 최고 모델도 전체 태스크의 3%만 95% 이상 통과하는 수준에 그쳤다.
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
티켓 3장으로 쪼개면 Claude/GPT도 보안 취약점 코드를 53~86% 확률로 그냥 짜준다.
Refusal in Language Models Is Mediated by a Single Direction
Open-source chat models encode safety as a single vector direction, and removing it disables safety fine-tuning.
Show HN: A new benchmark for testing LLMs for deterministic outputs
Structured Output Benchmark assesses LLM JSON handling across seven metrics, revealing performance beyond schema compliance.