TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis

Mar 18, 2026•Pepe Alonso•View PDF

TL;DR Highlight

An open-source tool that tells AI coding agents which tests will be affected before code changes, reducing regressions by 70%

Who Should Read

Developers who've adopted AI coding agents (Cursor, OpenHands, etc.) in their CI/CD pipeline but struggle with broken existing tests. ML engineers evaluating agent performance with benchmarks like SWE-bench or looking to improve code automation quality.

Core Mechanics

Telling the AI agent 'which tests will be affected' before bug fixes reduces regressions by 70% (6.08% → 1.82%) — giving information matters more than teaching methods
Adding TDD procedure prompts ('write tests first') actually worsens regressions (6.08% → 9.94%) — the 'TDD Prompting Paradox': procedural instructions waste context in smaller models
TDAD parses Python repos into ASTs to build source↔test file dependency graphs and outputs affected test lists as greppable text files — works without MCP servers or databases
Deployed as an agent skill with Qwen3.5-35B-A3B + OpenCode: issue resolution rate 24% → 32%, patch generation rate 40% → 68%
Just reducing SKILL.md (agent instruction file) from 107 lines to 20 lines quadrupled resolution rate (12% → 50%) — smaller models benefit most from short, specific context
Running Claude Code's auto-improvement loop 15 times: resolution rate 12% → 60% while maintaining 0% regression

Evidence

Phase 1 (Qwen3-Coder 30B, 100 instances): P2P test failures 562 → 155 (72% reduction), regression rate 6.08% → 1.82%
TDD prompt only: P2P failures increased to 799 (42% worse than vanilla 562), catastrophic regressions 3 → 5
Phase 2 (Qwen3.5-35B-A3B + OpenCode, 25 instances): resolution rate 24% → 32% (+8pp), patch generation rate 40% → 68% (+28pp)
Auto-improvement loop: generation 28% → 80%, resolution 12% → 60%, 0% regression maintained throughout

How to Apply

Run `pip install tdad` then `tdad index` in your Python repo to generate test_map.txt — include this file in the agent's context so it can grep affected tests and verify before patching
If your existing agent prompt has long TDD procedure instructions, remove them and replace with a short SKILL.md (20 lines) containing only 'which tests to check' — especially effective for sub-30B models
If you're doing SWE-bench-style evaluation, report PASS_TO_PASS (P2P) failure count alongside resolution rate — consider designing a composite metric using net score = resolution rate - alpha x regression rate

Code Example

snippet

# Installation
pip install tdad

# 1. Repo indexing (generate dependency graph)
cd /path/to/your/python/repo
tdad index
# → Creates .tdad/graph.pkl, test_map.txt

# 2. Query affected tests when a specific file changes
tdad impact src/mymodule/utils.py
# → Outputs list of affected tests

# 3. Example SKILL.md for agents (key: keep within 20 lines)
"""
## Test Verification Skill
1. Fix the bug in the relevant source file.
2. Run: grep '<changed_file_stem>' test_map.txt
   to find related test files.
3. Run the identified tests with pytest.
4. If any tests fail, revise the patch and re-verify.
5. Only submit when all identified tests pass.
"""

# 4. Example usage of test_map.txt (executed by agent at runtime)
# grep 'utils' test_map.txt
# → tests/test_utils.py
# → tests/integration/test_pipeline.py

Terminology

regressionWhen adding new code or fixing a bug breaks something that previously worked. Like fixing one spot in a house and causing a leak somewhere else.

P2P (PASS_TO_PASS)Existing tests that should pass both before and after a patch. If these break, regression has occurred.

impact analysisTracing which parts of a system will be affected when a single line of code changes. Like predicting how far dominos will fall before pushing the first one.

AST (Abstract Syntax Tree)A tree-structure representation of code. The internal representation compilers and analysis tools use to understand code. Think of it as a 'blueprint of the code.'

RTS (Regression Test Selection)A technique that selects and runs only tests related to changed code. Saves time by running relevant tests instead of the full suite.

GraphRAGA retrieval approach using graph structures instead of plain vector search for more accurate information retrieval. Finds more relevant results by understanding relationships rather than simple keyword matching.

context windowThe maximum text length an LLM can process at once. Exceeding this limit means it forgets earlier content. Like a character limit on a single page of notes.

SWE-bench VerifiedA benchmark evaluating how well AI agents solve real GitHub issues. A collection of 500 human-verified Python project bug fix problems.

Related Resources

Original Abstract (Expand)

AI coding agents can resolve real-world software issues, yet they frequently introduce regressions, breaking tests that previously passed. Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under-studied. This paper presents TDAD (Test-Driven Agentic Development), an open-source tool and benchmark methodology that combines abstract-syntax-tree (AST) based code-test graph construction with weighted impact analysis to surface the tests most likely affected by a proposed change. Evaluated on SWE-bench Verified with two local models (Qwen3-Coder 30B on 100 instances and Qwen3.5-35B-A3B on 25 instances), TDAD's GraphRAG workflow reduced test-level regressions by 70% (6.08% to 1.82%) and improved resolution from 24% to 32% when deployed as an agent skill. A surprising finding is that TDD prompting alone increased regressions (9.94%), revealing that smaller models benefit more from contextual information (which tests to verify) than from procedural instructions (how to do TDD). An autonomous auto-improvement loop raised resolution from 12% to 60% on a 10-instance subset with 0% regression. These findings suggest that for AI agent tool design, surfacing contextual information outperforms prescribing procedural workflows. All code, data, and logs are publicly available at https://github.com/pepealonso95/TDAD.