TDAD: Test-Driven Agentic Development - Graph 기반 Impact Analysis로 AI 코딩 에이전트의 Code Regression 줄이기

TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis

Mar 18, 2026•Pepe Alonso•View PDF

TL;DR Highlight

AI 코딩 에이전트가 코드 수정 전에 어떤 테스트가 영향받는지 미리 알려줘서 regression을 70% 줄이는 오픈소스 툴

Who Should Read

AI 코딩 에이전트(Cursor, OpenHands 등)를 CI/CD 파이프라인에 도입했는데 기존 테스트가 깨지는 문제로 고민하는 개발자. SWE-bench 같은 벤치마크로 에이전트 성능을 평가하거나 코드 자동화 품질을 높이려는 ML 엔지니어.

Core Mechanics

AI 에이전트가 버그 수정 전에 '어떤 테스트가 영향받는지' 알려주면 regression이 70% 감소 (6.08% → 1.82%) — 방법을 가르치는 게 아니라 정보를 주는 게 핵심
TDD 절차 프롬프트('테스트 먼저 짜라')를 추가하면 오히려 regression이 악화됨 (6.08% → 9.94%) — 'TDD Prompting Paradox': 절차 지시는 소형 모델의 context를 낭비함
TDAD는 Python repo를 AST로 파싱해 소스 파일↔테스트 파일 의존성 그래프를 만들고, 영향받는 테스트 목록을 grep 가능한 text 파일로 출력 — MCP 서버나 DB 없이 동작
에이전트 스킬로 배포했을 때 Qwen3.5-35B-A3B + OpenCode 조합에서 issue 해결률 24% → 32%, 패치 생성률 40% → 68% 향상
SKILL.md(에이전트 지시 파일)를 107줄에서 20줄로 줄이는 것만으로 해결률이 4배 상승 (12% → 50%) — 소형 모델일수록 짧고 구체적인 context가 효과적
Claude Code가 자기 자신을 개선하는 auto-improvement loop 15회 실행 결과, 해결률 12% → 60%로 상승하면서 regression은 0% 유지

Evidence

Phase 1 (Qwen3-Coder 30B, 100 인스턴스): P2P 테스트 실패 562건 → 155건 (72% 감소), regression rate 6.08% → 1.82%
TDD 프롬프트만 추가 시 P2P 실패 799건으로 오히려 증가 (vanilla 562건 대비 42% 악화), catastrophic regression 3건 → 5건
Phase 2 (Qwen3.5-35B-A3B + OpenCode, 25 인스턴스): 해결률 24% → 32% (+8pp), 패치 생성률 40% → 68% (+28pp)
Auto-improvement loop: generation 28% → 80%, resolution 12% → 60%, 전 구간 regression 0% 유지

How to Apply

pip install tdad 후 Python repo에서 tdad index 실행하면 test_map.txt 생성됨 — 이 파일을 에이전트의 context에 포함시키면 에이전트가 grep으로 영향받는 테스트를 조회하고 패치 전에 검증 가능
기존 에이전트 프롬프트에 TDD 절차 지시문이 길게 들어가 있다면 제거하고, 대신 '어떤 테스트를 확인해야 하는지'만 담은 짧은 SKILL.md(20줄)로 교체 — 특히 30B 이하 소형 모델 사용 시 효과적
SWE-bench 스타일 평가를 하고 있다면 resolution rate 외에 PASS_TO_PASS(P2P) 실패 수도 함께 리포트하는 게 좋음 — net score = resolution rate − α × regression rate 공식으로 복합 지표 설계 가능

Code Example

snippet

# 설치
pip install tdad

# 1. 레포 인덱싱 (의존성 그래프 생성)
cd /path/to/your/python/repo
tdad index
# → .tdad/graph.pkl, test_map.txt 생성

# 2. 특정 파일 변경 시 영향받는 테스트 조회
tdad impact src/mymodule/utils.py
# → 영향받는 테스트 목록 출력

# 3. 에이전트용 SKILL.md 예시 (핵심: 20줄 이내로 유지)
"""
## Test Verification Skill
1. Fix the bug in the relevant source file.
2. Run: grep '<changed_file_stem>' test_map.txt
   to find related test files.
3. Run the identified tests with pytest.
4. If any tests fail, revise the patch and re-verify.
5. Only submit when all identified tests pass.
"""

# 4. test_map.txt 활용 예시 (에이전트가 런타임에 실행)
# grep 'utils' test_map.txt
# → tests/test_utils.py
# → tests/integration/test_pipeline.py

Terminology

regression새 코드를 추가하거나 버그를 고쳤더니 예전에 잘 되던 기능이 갑자기 망가지는 현상. 집 한 곳 고쳤더니 다른 데 누수가 생기는 것과 같음.

P2P (PASS_TO_PASS)패치 적용 전후 모두 통과해야 하는 기존 테스트들. 이게 깨지면 regression 발생.

impact analysis코드 한 줄을 바꿨을 때 시스템의 어느 부분이 영향받는지 미리 추적하는 분석. 도미노 넘어뜨리기 전에 어디까지 쓰러질지 예측하는 것.

AST (Abstract Syntax Tree)코드를 트리 구조로 표현한 것. 컴파일러나 분석 도구가 코드를 이해할 때 사용하는 내부 표현. '코드의 설계도' 정도로 이해하면 됨.

RTS (Regression Test Selection)변경된 코드와 관련 있는 테스트만 골라서 실행하는 기법. 전체 테스트 대신 관련 테스트만 돌려서 시간 절약.

GraphRAG일반 벡터 검색 대신 그래프 구조를 활용해 더 정확한 정보를 검색하는 방식. 단순 키워드 검색보다 관계를 이해해서 더 관련성 높은 결과를 찾음.

context windowLLM이 한 번에 처리할 수 있는 텍스트의 최대 길이. 이 한도를 넘으면 앞 내용을 잊어버림. 노트 한 장에 적을 수 있는 글자 수 제한 같은 것.

SWE-bench VerifiedAI 에이전트가 실제 GitHub 이슈를 얼마나 잘 해결하는지 평가하는 벤치마크. 인간이 검증한 500개 Python 프로젝트 버그 수정 문제 모음.

Related Resources

Original Abstract (Expand)

AI coding agents can resolve real-world software issues, yet they frequently introduce regressions, breaking tests that previously passed. Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under-studied. This paper presents TDAD (Test-Driven Agentic Development), an open-source tool and benchmark methodology that combines abstract-syntax-tree (AST) based code-test graph construction with weighted impact analysis to surface the tests most likely affected by a proposed change. Evaluated on SWE-bench Verified with two local models (Qwen3-Coder 30B on 100 instances and Qwen3.5-35B-A3B on 25 instances), TDAD's GraphRAG workflow reduced test-level regressions by 70% (6.08% to 1.82%) and improved resolution from 24% to 32% when deployed as an agent skill. A surprising finding is that TDD prompting alone increased regressions (9.94%), revealing that smaller models benefit more from contextual information (which tests to verify) than from procedural instructions (how to do TDD). An autonomous auto-improvement loop raised resolution from 12% to 60% on a 10-instance subset with 0% regression. These findings suggest that for AI agent tool design, surfacing contextual information outperforms prescribing procedural workflows. All code, data, and logs are publicly available at https://github.com/pepealonso95/TDAD.