Kitchen Loop: User-Spec-Driven Development로 만드는 자가 진화 코드베이스

The Kitchen Loop: User-Spec-Driven Development for a Self-Evolving Codebase

Mar 26, 2026•Yannick Roy•View PDF

TL;DR Highlight

LLM 에이전트가 1000배 빠른 속도로 제품 스펙을 직접 사용해보며 버그를 찾고 PR을 자동 머지하는 자율 소프트웨어 진화 프레임워크

Who Should Read

AI 코딩 에이전트를 프로덕션에 도입하려는 개발팀 리드 또는 백엔드 엔지니어. 특히 AI가 생성한 코드의 품질 저하와 회귀(regression) 문제를 구조적으로 해결하고 싶은 사람.

Core Mechanics

LLM 에이전트가 실제 유저처럼 스펙 표면(specification surface)을 직접 사용해보는 'As a User × 1000(AaU1000)' 방식으로 인간 대비 ~24~48배 빠른 PR 처리량 달성
T1 Foundation(30%) → T2 Composition(50%) → T3 Frontier(20%) 3단계 시나리오 전략으로 무작위 테스트 대비 더 많은 버그 발견. Composition 단계가 피처 수 증가에 따라 초선형적으로 성장
동일한 에이전트가 작성한 유닛 테스트는 믿을 수 없음 — 38개 유닛 테스트가 전부 통과했는데 핵심 기능이 완전히 망가진 실제 사례 발생. L3/L4 E2E 테스트와 UAT Gate로만 진짜 품질 보장
Gemini + Codex(GPT) + Claude 3개 모델이 독립적으로 PR 리뷰하는 Multi-Model Tribunal로 단일 모델의 오판 방지. 어떤 모델의 출력도 그대로 수용하지 않음
Regression Oracle + Drift Control + 자동 Pause Gate로 285회+ 반복, 1,094개+ PR 머지에서 회귀 버그 0건 달성. 품질 게이트 76~91% → 100% 단조 증가
월 ~$350 고정 비용(Claude Code Max $200 + Codex $20 + Gemini $20 + CodeRabbit $15 + CI $50~100)으로 두 프로덕션 시스템 동시 운영. PR당 비용 ~$0.38

Evidence

DeFi SDK 122회+ 반복에서 728개+ PR 머지, 10,913개 유닛 테스트(초기 6,400개), 데모 전략 62개(초기 13개), 회귀 버그 0건
Signal Platform 163회 반복에서 366개 PR 머지(merge rate 97%), L1/L2/L3 pass rate 전부 76~91% → 100%, Tier 1 canary escape 0건(163회 전체)
PR당 비용 ~$0.38 vs 시니어 엔지니어 PR당 $600~1,000 — 약 1,800배 저렴, 월 PR 생산량 600+ vs 인간 15~25개
Cursor AI 도입 시 +30% static analysis 경고, +42% 코드 복잡도 증가(He et al. 2025) 문제를 Kitchen Loop의 검증 레이어로 구조적으로 차단

How to Apply

자신의 제품 스펙을 'N features × M platforms × K actions' 매트릭스로 정리하고, 빈 셀을 우선순위(P0~P3)로 분류한다. 이것이 Loop의 Ideation 입력이 된다.
기존 유닛 테스트 외에 L3/L4 검증을 추가한다 — 웹앱이면 Playwright로 실제 브라우저 자동화, 백엔드면 실제 API 호출 + 상태 변화 전후 비교(State Delta)를 assertion으로 추가한다.
PR 머지 전 'sealed test card(봉인된 테스트 카드)' 패턴을 도입한다 — 구현한 에이전트와 다른 약한 모델(예: Haiku)이 맥락 없이 카드만 보고 테스트를 실행하게 해서 happy-path 편향과 치팅을 방지한다.

Code Example

snippet

# Kitchen Loop UAT Gate 핵심 패턴 예시

## 1. Sealed Test Card 형식 (구현 에이전트가 작성)
```
## Test Card: Backtest Service API

Step 1:
  command: curl -X POST http://localhost:8000/api/v1/backtest -H 'Content-Type: application/json' -d '{"strategy": "simple_ma"}'
  expected_exit_code: 0
  expected_output_contains: '"job_id"'

Step 2 (poll for completion):
  command: curl http://localhost:8000/api/v1/backtest/{job_id}
  expected_exit_code: 0
  expected_output_contains: '"status": "completed"'
  expected_output_contains: '"pnl"'

Step 3 (verify rejection of bad input):
  command: curl -X POST http://localhost:8000/api/v1/backtest -d '{"strategy": ""}'
  expected_exit_code: 0
  expected_output_contains: '"error"'
```

## 2. Fresh Evaluator 실행 (Haiku 같은 약한 모델, 제로 컨텍스트)
```python
def run_uat_gate(test_card_path: str, worktree_path: str) -> dict:
    """
    - Information wall: evaluator gets ONLY the test card
    - Weak model: use cheapest/weakest available
    - Read-only: cannot modify product files
    """
    evaluator_prompt = f"""
    You are a fresh user with zero context about the implementation.
    Execute EVERY step in this test card exactly as written.
    Do NOT modify any product files.
    Report exact command output and exit codes.
    
    Test Card:
    {open(test_card_path).read()}
    """
    
    result = run_agent(
        model="claude-haiku",  # weakest available
        prompt=evaluator_prompt,
        working_dir=worktree_path,
        read_only=True
    )
    
    # Anti-cheat: verify no product files were modified
    git_diff = subprocess.run(
        ["git", "diff", "--name-only"],
        cwd=worktree_path, capture_output=True, text=True
    ).stdout.strip()
    
    if git_diff:
        return {"verdict": "EVAL_CHEAT_FAIL", "modified_files": git_diff}
    
    return parse_verdict(result)

# 가능한 verdict:
# PASS → PR 머지 진행
# PRODUCT_FAIL → 티켓 오픈 유지, PR에 uat-failed 태그
# UAT_SPEC_FAIL → 테스트 카드 자체가 불명확
# EVAL_CHEAT_FAIL → 평가자가 파일 수정 시도 (심각한 문제)
```

## 3. 3-Tier 시나리오 배분 프롬프트
```
System: You are the Kitchen Loop ideation agent.
Select the next test scenario using this distribution:
- 30% Foundation (T1): Single feature, happy path, must always work
- 50% Composition (T2): Combine 2+ features, find seam bugs  
- 20% Frontier (T3): Reach beyond current capabilities, produce gap analysis

Current spec surface: {spec_matrix}
Blocked combos (known broken, skip): {blocked_combos}
Recent iterations: {last_5_scenarios}

Select ONE scenario and document:
1. Usage scenario (what a real user would try)
2. Expected behavior
3. How to verify it worked (state deltas)
```

Terminology

Specification Surface제품이 지원한다고 주장하는 기능 목록 전체. '기능 × 플랫폼 × 액션 타입'의 매트릭스로 표현되며, 이 매트릭스의 각 셀이 테스트해야 할 하나의 클레임.

Regression Oracle매 반복마다 '시스템이 이전보다 나빠지지 않았나?'를 자동으로 판정하는 테스트 묶음. 인간 QA가 아닌 자동화된 심판.

UAT Gate구현한 에이전트가 아닌 완전히 다른(맥락 없는) 약한 모델이 사용자 입장에서 기능을 검증하는 단계. '내가 만든 시험지를 내가 채점하는' 문제를 방지.

Coverage-Exhaustion Mode이슈 하나를 해결하는 것이 목표가 아니라, 스펙 매트릭스의 모든 조합을 체계적으로 테스트해서 커버리지 공백을 제로로 만드는 운영 방식.

Drift Control여러 반복에 걸쳐 품질 지표 추세를 모니터링하는 것. 개별 테스트 통과/실패가 아니라 트렌드를 보기 때문에 조용히 악화되는 품질 저하를 미리 감지.

Anti-Signal Canary의도적으로 만든 '나쁜 입력'을 실제 입력과 함께 흘려보내서 품질 게이트가 제대로 작동하는지 검증하는 장치. 경보 시스템을 테스트하는 가짜 화재 훈련.

Multi-Model Tribunal동일한 코드나 결과물을 서로 다른 3개 AI 모델(Gemini, Codex, Claude)이 독립적으로 리뷰하고 다수결로 판정하는 구조. 단일 모델의 편향과 오판을 줄임.

State Delta액션 실행 전후의 실제 상태 변화(예: 지갑 잔액 변화)를 측정해서 코드가 의도한 결과를 실제로 만들었는지 검증하는 방법. '실행 성공'과 '올바른 결과'는 다름.

Related Resources

Original Abstract (Expand)

Code production is now a commodity; the bottleneck is knowing what to build and proving it works. We present the Kitchen Loop, a framework for autonomous, self-evolving software built on a unified trust model: (1) a specification surface enumerating what the product claims to support; (2) 'As a User x 1000', where an LLM agent exercises that surface as a synthetic power user at 1,000x human cadence; (3) Unbeatable Tests, ground-truth verification the code author cannot fake; and (4) Drift Control, continuous quality measurement with automated pause gates. We validate across two production systems over 285+ iterations, producing 1,094+ merged pull requests with zero regressions detected by the regression oracle (methodology in Section 6.1). We observe emergent properties at scale: multi-iteration self-correction chains, autonomous infrastructure healing, and monotonically improving quality gates. The primitives are not new; our contribution is their composition into a production-tested system with the operational discipline that makes long-running autonomous evolution safe.