TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis
TL;DR Highlight
An open-source tool that tells AI coding agents which tests will be affected before code changes, reducing regressions by 70%
Who Should Read
Developers who've adopted AI coding agents (Cursor, OpenHands, etc.) in their CI/CD pipeline but struggle with broken existing tests. ML engineers evaluating agent performance with benchmarks like SWE-bench or looking to improve code automation quality.
Core Mechanics
- Telling the AI agent 'which tests will be affected' before bug fixes reduces regressions by 70% (6.08% → 1.82%) — giving information matters more than teaching methods
- Adding TDD procedure prompts ('write tests first') actually worsens regressions (6.08% → 9.94%) — the 'TDD Prompting Paradox': procedural instructions waste context in smaller models
- TDAD parses Python repos into ASTs to build source↔test file dependency graphs and outputs affected test lists as greppable text files — works without MCP servers or databases
- Deployed as an agent skill with Qwen3.5-35B-A3B + OpenCode: issue resolution rate 24% → 32%, patch generation rate 40% → 68%
- Just reducing SKILL.md (agent instruction file) from 107 lines to 20 lines quadrupled resolution rate (12% → 50%) — smaller models benefit most from short, specific context
- Running Claude Code's auto-improvement loop 15 times: resolution rate 12% → 60% while maintaining 0% regression
Evidence
- Phase 1 (Qwen3-Coder 30B, 100 instances): P2P test failures 562 → 155 (72% reduction), regression rate 6.08% → 1.82%
- TDD prompt only: P2P failures increased to 799 (42% worse than vanilla 562), catastrophic regressions 3 → 5
- Phase 2 (Qwen3.5-35B-A3B + OpenCode, 25 instances): resolution rate 24% → 32% (+8pp), patch generation rate 40% → 68% (+28pp)
- Auto-improvement loop: generation 28% → 80%, resolution 12% → 60%, 0% regression maintained throughout
How to Apply
- Run `pip install tdad` then `tdad index` in your Python repo to generate test_map.txt — include this file in the agent's context so it can grep affected tests and verify before patching
- If your existing agent prompt has long TDD procedure instructions, remove them and replace with a short SKILL.md (20 lines) containing only 'which tests to check' — especially effective for sub-30B models
- If you're doing SWE-bench-style evaluation, report PASS_TO_PASS (P2P) failure count alongside resolution rate — consider designing a composite metric using net score = resolution rate - alpha x regression rate
Code Example
# Installation
pip install tdad
# 1. Repo indexing (generate dependency graph)
cd /path/to/your/python/repo
tdad index
# → Creates .tdad/graph.pkl, test_map.txt
# 2. Query affected tests when a specific file changes
tdad impact src/mymodule/utils.py
# → Outputs list of affected tests
# 3. Example SKILL.md for agents (key: keep within 20 lines)
"""
## Test Verification Skill
1. Fix the bug in the relevant source file.
2. Run: grep '<changed_file_stem>' test_map.txt
to find related test files.
3. Run the identified tests with pytest.
4. If any tests fail, revise the patch and re-verify.
5. Only submit when all identified tests pass.
"""
# 4. Example usage of test_map.txt (executed by agent at runtime)
# grep 'utils' test_map.txt
# → tests/test_utils.py
# → tests/integration/test_pipeline.pyTerminology
Related Resources
Original Abstract (Expand)
AI coding agents can resolve real-world software issues, yet they frequently introduce regressions, breaking tests that previously passed. Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under-studied. This paper presents TDAD (Test-Driven Agentic Development), an open-source tool and benchmark methodology that combines abstract-syntax-tree (AST) based code-test graph construction with weighted impact analysis to surface the tests most likely affected by a proposed change. Evaluated on SWE-bench Verified with two local models (Qwen3-Coder 30B on 100 instances and Qwen3.5-35B-A3B on 25 instances), TDAD's GraphRAG workflow reduced test-level regressions by 70% (6.08% to 1.82%) and improved resolution from 24% to 32% when deployed as an agent skill. A surprising finding is that TDD prompting alone increased regressions (9.94%), revealing that smaller models benefit more from contextual information (which tests to verify) than from procedural instructions (how to do TDD). An autonomous auto-improvement loop raised resolution from 12% to 60% on a 10-instance subset with 0% regression. These findings suggest that for AI agent tool design, surfacing contextual information outperforms prescribing procedural workflows. All code, data, and logs are publicly available at https://github.com/pepealonso95/TDAD.