Code Review Agent Benchmark
TL;DR Highlight
Evaluates code review agents using executable tests instead of text similarity — Claude Code 32.1%, all 4 tools combined 41.5%, vs human 100%
Who Should Read
Developers evaluating or building AI-based code review tools; engineers designing automated code quality pipelines
Core Mechanics
- Current SOTA code review agents (Claude Code 32.1%, Devin 24.8%, PR-Agent 23.1%, Codex 20.1%) fall far short of human reviewers (100%) — combined union reaches only 41.5%
- Automated tools excel at Robustness and Testing but severely underperform on Maintainability (7.9–27%), Design, and Documentation — they lack repository-specific conventions
- Claude Code achieves the highest Robustness pass rate (75%) but generates 7.3 comments per PR on average — highest volume, increasing developer burden
- Automated tools and human reviewers focus on different aspects of code — they are complementary, not replacements
- Documenting repository-specific context (e.g., AGENTS.md with coding conventions) is the most actionable direction for improving automated tools
Evidence
- 671 PRs from SWE-CARE → 4-stage pipeline (review filtering, Docker environment, NL→tests, agent validation) → final 184 PRs, 234 tests, 67 repositories
- Execution-based evaluation: human review comments converted to fail-then-pass tests — overcomes the limitation of BLEU/ROUGE/embedding similarity (scores 0 for same issue in different wording)
How to Apply
- Benchmark your own code review agent using the c-CRAB dataset at github.com/c-CRAB-Benchmark
- Add AGENTS.md with repository coding conventions and architecture rules to improve automated tool alignment
- Combine automated review (strong on Robustness and Testing) with human review (strong on Maintainability and Design) for a complementary workflow
Terminology
Original Abstract (Expand)
Software engineering agents have shown significant promise in writing code. As AI agents permeate code writing, and generate huge volumes of code automatically -- the matter of code quality comes front and centre. As the automatically generated code gets integrated into huge code-bases -- the issue of code review and broadly quality assurance becomes important. In this paper, we take a fresh look at the problem and curate a code review dataset for AI agents to work with. Our dataset called c-CRAB (pronounced see-crab) can evaluate agents for code review tasks. Specifically given a pull-request (which could be coming from code generation agents or humans), if a code review agent produces a review, our evaluation framework can asses the reviewing capability of the code review agents. Our evaluation framework is used to evaluate the state of the art today -- the open-source PR-agent, as well as commercial code review agents from Devin, Claude Code, and Codex. Our c-CRAB dataset is systematically constructed from human reviews -- given a human review of a pull request instance we generate corresponding tests to evaluate the code review agent generated reviews. Such a benchmark construction gives us several insights. Firstly, the existing review agents taken together can solve only around 40% of the c-CRAB tasks, indicating the potential to close this gap by future research. Secondly, we observe that the agent reviews often consider different aspects from the human reviews -- indicating the potential for human-agent collaboration for code review that could be deployed in future software teams. Last but not the least, the agent generated tests from our data-set act as a held out test-suite and hence quality gate for agent generated reviews. What this will mean for future collaboration of code generation agents, test generation agents and code review agents -- remains to be investigated.