Are LLM merge rates not getting better?
TL;DR Highlight
Re-analyzing METR's SWE-bench data shows LLM code quality (actual PR merge rates) lags far behind benchmark scores — the gap between benchmark and reality is widening.
Who Should Read
Engineers evaluating LLMs for software development, researchers studying AI coding benchmarks, and anyone trying to assess how much to trust SWE-bench-style metrics.
Core Mechanics
- METR's SWE-bench dataset includes metadata about whether human-submitted PRs were actually merged — a real-world quality signal.
- Re-analysis found that LLM-generated solutions that 'pass' SWE-bench tests have significantly lower actual merge rates than human PRs at the same task difficulty.
- This suggests SWE-bench scores overstate practical code quality: models are optimizing for test passage (which can be gamed) rather than code that a human reviewer would approve.
- The gap is larger on harder tasks: for simple bug fixes, LLM and human quality are closer; for complex architectural changes, the gap widens substantially.
- This is a benchmark contamination/Goodhart's law dynamic: once a benchmark becomes a target, models optimize for it in ways that diverge from the underlying quality it was meant to measure.
Evidence
- The analysis compared SWE-bench test pass rates against the historical PR merge rates for the same tasks, finding a substantial divergence.
- HN commenters with experience evaluating LLMs for production code confirmed this matches their empirical experience — models that ace benchmarks often produce code that fails code review.
- Some noted that SWE-bench uses a specific set of open-source repos that models have likely seen in training, adding data contamination as another validity concern.
- Researchers noted this highlights the need for 'lived benchmarks' — evaluations based on real-world usage rather than held-out test sets.
How to Apply
- Don't rely solely on SWE-bench scores when evaluating LLMs for production coding use cases — run your own evals on code samples representative of your actual codebase.
- When building LLM-assisted code review pipelines, instrument your actual merge/rejection rates for AI-suggested code — this is the ground truth metric SWE-bench approximates.
- For researchers: designing evaluations that can't easily be gamed (test-time optimization) is the key challenge — consider proxies like code reviewer acceptance or downstream test coverage rather than just unit test passage.
Terminology
SWE-benchA benchmark for evaluating LLMs on real-world software engineering tasks, using GitHub issues and PRs from open-source repos as test cases.
Goodhart's LawWhen a measure becomes a target, it ceases to be a good measure — models optimizing for benchmark scores may diverge from the underlying capability the benchmark was designed to measure.
METRModel Evaluation and Threat Research — an AI safety organization that evaluates AI model capabilities, including coding ability.
Benchmark contaminationWhen a model's training data includes the benchmark test cases, inflating scores without reflecting genuine capability.