The Kitchen Loop: User-Spec-Driven Development for a Self-Evolving Codebase

Mar 26, 2026•Yannick Roy•View PDF

TL;DR Highlight

An autonomous software evolution framework where LLM agents directly exercise product specs at 1000x speed to find bugs and auto-merge PRs

Who Should Read

Development team leads or backend engineers looking to adopt AI coding agents in production — especially those who want to structurally address code quality degradation and regression problems caused by AI-generated code.

Core Mechanics

Achieved ~24–48x faster PR throughput compared to humans using the 'As a User × 1000 (AaU1000)' approach, where LLM agents directly exercise the specification surface as real users would
T1 Foundation (30%) → T2 Composition (50%) → T3 Frontier (20%) three-tier scenario strategy finds more bugs than random testing; the Composition tier grows super-linearly as feature count increases
Unit tests written by the same agent that wrote the code cannot be trusted — a real case occurred where 38 unit tests all passed while a core feature was completely broken; only L3/L4 E2E tests and UAT Gate can guarantee true quality
Multi-Model Tribunal with Gemini + Codex (GPT) + Claude independently reviewing PRs prevents any single model's misjudgment — no model's output is accepted as-is
Regression Oracle + Drift Control + automated Pause Gate achieved zero regression bugs across 285+ iterations and 1,094+ merged PRs; quality gate pass rate monotonically increased from 76–91% → 100%
Running two production systems simultaneously at ~$350/month fixed cost (Claude Code Max $200 + Codex $20 + Gemini $20 + CodeRabbit $15 + CI $50–100); ~$0.38 cost per PR

Evidence

DeFi SDK: 728+ PRs merged over 122+ iterations, 10,913 unit tests (from initial 6,400), 62 demo scenarios (from initial 13), zero regression bugs
Signal Platform: 366 PRs merged over 163 iterations (97% merge rate), L1/L2/L3 pass rates all improved from 76–91% → 100%, zero Tier 1 canary escapes across all 163 iterations
~$0.38 cost per PR vs. $600–1,000 per PR for a senior engineer — approximately 1,800x cheaper; monthly PR output 600+ vs. 15–25 for humans
The +30% static analysis warnings and +42% code complexity increase introduced by Cursor AI (He et al. 2025) are structurally blocked by Kitchen Loop's validation layers

How to Apply

Organize your product spec as an 'N features × M platforms × K actions' matrix and classify empty cells by priority (P0–P3). This becomes the input to the Loop's Ideation stage.
Add L3/L4 validation on top of your existing unit tests — for web apps, use Playwright for real browser automation; for backends, add actual API calls with before/after state comparison (State Delta) as assertions.
Introduce the 'sealed test card' pattern before merging PRs — have a weaker model (e.g., Haiku) that is different from the implementing agent run the tests with only the card and zero context, preventing happy-path bias and cheating.

Code Example

snippet

Terminology

Specification SurfaceThe complete list of features a product claims to support. Expressed as a 'features × platforms × action types' matrix, where each cell represents one claim that must be tested.

Regression OracleAn automated test suite that judges 'has the system gotten worse than before?' at every iteration — an automated referee, not a human QA.

UAT GateA validation stage where a completely different, context-free weak model (not the implementing agent) verifies functionality from a user's perspective — prevents the 'grading your own exam' problem.

Coverage-Exhaustion ModeAn operating mode whose goal is not to fix one issue, but to systematically test every combination in the spec matrix to bring coverage gaps to zero.

Drift ControlMonitoring the trend of quality metrics across multiple iterations. Because it looks at trends rather than individual test pass/fail results, it detects quietly degrading quality before it becomes critical.

Anti-Signal CanaryA mechanism that injects intentionally 'bad inputs' alongside real inputs to verify that quality gates are functioning correctly — like a fire drill to test the alarm system.

Multi-Model TribunalA structure where three different AI models (Gemini, Codex, Claude) independently review the same code or output and reach a majority verdict — reducing bias and misjudgment from any single model.

State DeltaA method of verifying that code actually produced the intended result by measuring real state changes before and after an action (e.g., wallet balance change) — because 'execution success' and 'correct result' are not the same thing.

Related Resources

Original Abstract (Expand)

Code production is now a commodity; the bottleneck is knowing what to build and proving it works. We present the Kitchen Loop, a framework for autonomous, self-evolving software built on a unified trust model: (1) a specification surface enumerating what the product claims to support; (2) 'As a User x 1000', where an LLM agent exercises that surface as a synthetic power user at 1,000x human cadence; (3) Unbeatable Tests, ground-truth verification the code author cannot fake; and (4) Drift Control, continuous quality measurement with automated pause gates. We validate across two production systems over 285+ iterations, producing 1,094+ merged pull requests with zero regressions detected by the regression oracle (methodology in Section 6.1). We observe emergent properties at scale: multi-iteration self-correction chains, autonomous infrastructure healing, and monotonically improving quality gates. The primitives are not new; our contribution is their composition into a production-tested system with the operational discipline that makes long-running autonomous evolution safe.