A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents | AI Paper Digest

TL;DR Highlight

LLM agent가 왜 터지는지 이름 붙이고, 어떤 아키텍처 패턴을 언제 써야 하는지 5단계로 정리한 실전 가이드

Who Should Read

LLM agent를 프로덕션에 배포하면서 '모델은 멀쩡한데 시스템이 이상하게 동작한다'는 문제를 겪고 있는 백엔드/ML 엔지니어. AutoGen, LangChain, CrewAI 같은 프레임워크를 쓰면서 아키텍처 설계 기준이 없어 고민하는 팀.

Core Mechanics

LLM agent 장애의 71%는 모델 문제가 아니라 'LLM 출력이 실제 시스템 액션이 되는 경계(Stochastic-Deterministic Boundary, SDB)'가 약해서 발생한다 — 21개 장애 사후분석 분류 결과.
SDB는 4개 파트로 구성: proposer(LLM 출력), verifier(결정론적 검증), commit(확정 쓰기), reject signal(실패 시 LLM에 돌려주는 응답). GPT-4o→GPT-4.1 업그레이드 후 프롬프트 인젝션 저항성이 94%→71%로 23포인트 떨어진 사례도 verifier가 없어서였음.
agent 신뢰성 공식 y(t) = μt + σξ(t)에서 σ(모델 per-call 분산)는 세대가 지날수록 줄어들지만, μ(아키텍처 모멘텀)는 패턴 선택으로만 결정됨 — 모델이 좋아질수록 아키텍처가 더 중요해진다는 뜻.
3가지 관심사(Coordination/State/Control)와 6가지 패턴 카탈로그 제시: P1 계층적 위임, P2 Scatter-Gather+Saga, P3 Event-Driven Sequencing, P4 Supervisor+Gate, P5 Shared State Machine, P6 Human in the Loop.
'Replay Divergence'라는 새 장애 유형 명명: append-only 이벤트 로그(P3)를 쓰는 시스템에서 모델 버전이 바뀌면 같은 이벤트를 다시 읽어도 다른 결과가 나오는 현상. 이게 감지되면 P3→P5 마이그레이션 트리거.
5단계 선택 방법론 출력물은 6줄짜리 Architecture Decision Record(ADR) — Runtime 분류 → Spine 선택 → Coordination → Control → 빌드 순서. '대시보드 먼저, agent 나중'이 핵심 원칙.

Evidence

5개 오픈소스 프레임워크(openai/swarm, AutoGPT, LangChain, CrewAI, AutoGen) 21개 LLM-to-action 호출 지점 감사 결과 19곳(90%)에서 verifier+commit 로직 확인.
21개 agent 장애 사후분석 분류 결과: 15개(71.4%)가 SDB 취약점, 17개(81%)의 수정사항이 SDB 4개 파트 중 하나를 강화하는 방향.
GPT-4o→GPT-4.1 업그레이드 시 동일 평가 환경에서 프롬프트 인젝션 저항성 94%→71%로 23포인트 하락 (Promptfoo 사례); 수정은 output classifier + 엄격한 tool gating 추가.
openai/openai-agents-js #1104: rejected tool call에 status:'completed'가 반환되어 모델이 성공으로 착각하고 hallucination 발생 — reject signal이 핵심 계약의 일부임을 보여주는 실제 버그 사례.

How to Apply

새 agent 시스템 설계 시 코드 첫 줄 전에 '이 LLM 출력이 실제 액션이 되는 지점'을 찾아 SDB 4파트(proposer/verifier/commit/reject)를 명시적으로 정의하라. verifier는 반드시 결정론적 규칙(스키마 체크, policy 코드)이어야 하고 LLM 호출이면 안 됨.
Long-Horizon agent(90일 계약갱신, 번호이동 같은 워크플로)에서 '모델 업그레이드 후 결과가 달라진다'면 Replay Divergence 진단 절차 적용: 이전 모델 버전으로 실패 배치 재실행 → 해결되면 P3→P5 마이그레이션 검토(이벤트 로그 기반→State Machine 기반으로 전환).
워크플로 종류별 패턴 조합 참고: 실시간 상담 도구(Conversational)는 P1+P4로 충분, 배치 스캐너(Autonomous)는 P3+P2+P4, 규제/법적 결과가 따르는 Long-Horizon은 P5+P1+P2+P4+P6(full) 조합 사용.

Code Example

snippet

# 5단계 선택 방법론 실행 체크리스트

## Step 1: Runtime 분류
# Q: 작업 단위 하나의 지속시간은? 중간에 외부 세계가 바뀌나?
# → 수 초: Conversational | 수 분: Autonomous | 수 시간~수 일: Long-Horizon

## Step 2: Spine 선택 (P3 vs P5)
# P5(Shared State Machine) 선택 조건 — 아래 3개 모두 참이면:
predicate_1 = workflow_has_pauses_over_1_hour or has_external_waits
predicate_2 = not state_reconstructible_from_original_input
predicate_3 = world_can_change_during_pause  # 가격, 정책, 제품 EOL 등

if predicate_1 and predicate_2 and predicate_3:
    spine = "P5_SharedStateMachine"  # CAS 기반 durable row
elif not predicate_1:
    spine = "P3_EventDrivenSequencing"  # append-only log
else:
    spine = "None"  # 재구성 가능, 짧은 세션

## Step 3: Coordination 선택
# P1(Hierarchical Delegation): 단일 소유자, 독립적 서브태스크, 결정론적 merge 가능
# P2(Scatter-Gather+Saga): 외부 시스템 side-effect 있음, 일부 실패 허용, 부분 쓰기 비용 높음

## Step 4: Control 선택
# P4(Supervisor+Gate): 외부 시스템 side-effect 있으면 항상 포함
# P6(Human in the Loop): 법적/금전적 결과, 정책 범위 밖 케이스, 감사 필요

## Step 5: 빌드 순서
build_order = [
    "1. State schema + observability dashboard (먼저!)",
    "2. Gate(P4) + audit log",
    "3. Orchestrator(P1/P2) + sub-agent 1개",
    "4. 나머지 sub-agents",
    "5. P6 순서대로: kill_switch → escalation → approval → throttling"
]

## 6줄 ADR 출력 예시
adr = {
    "runtime_class": "Long-Horizon",
    "spine": "P5 (predicate 1,2,3 모두 참)",
    "coordination": "P1+P2 (단일 소유자 + 외부 side-effect)",
    "control": "P4+P6 (side-effect + 법적 결과)",
    "sequence": "console-first",
    "date_model": "2026 Q2 / Claude Sonnet 4.6"
}

Terminology

SDB (Stochastic-Deterministic Boundary)LLM이 제안한 출력이 실제 시스템 액션(DB 쓰기, API 호출 등)이 되는 경계선. 이 경계에 검증 로직이 없으면 hallucination이 그대로 실행됨.

Architectural Momentum (μ)아키텍처 설계 품질이 시간에 따라 누적되는 신뢰성 기여분. 모델이 아무리 좋아도 아키텍처가 나쁘면 장기적으로 신뢰성이 계속 떨어지는 현상.

Replay Divergence같은 이벤트 로그를 나중에 다시 실행했을 때 모델 버전이 달라져서 다른 결과가 나오는 현상. 이벤트 소싱 + LLM 조합의 특수한 버그 패턴.

CAS (Compare-And-Swap)DB에 쓸 때 '내가 읽은 버전이랑 지금 버전이 같을 때만 써라'는 조건부 업데이트. 여러 agent가 동시에 같은 상태를 바꾸려 할 때 충돌을 막는 기법.

Saga여러 단계에 걸친 분산 트랜잭션에서 중간에 실패하면 이미 완료된 단계를 역순으로 취소(보상)하는 패턴. 예: 결제 완료 후 배송 실패 시 결제 환불.

Event Sourcing현재 상태를 직접 저장하지 않고 '어떤 일이 있었는지' 이벤트 로그만 저장하는 방식. 로그를 처음부터 재생하면 현재 상태를 재구성할 수 있음.

ADR (Architecture Decision Record)왜 이 아키텍처를 선택했는지 6줄로 적어두는 문서. 나중에 팀원이나 감사자가 '왜 이렇게 만들었어?'라고 물어볼 때 보여주는 공식 기록.

Human in the Loop (HITL)agent가 특정 조건에서 자동으로 결정하지 않고 사람에게 승인/검토를 요청하는 패턴. kill switch, escalation, approval, throttling 4가지 제어 플레인으로 구성.

Related Papers

Related Resources

Original Abstract (Expand)

Production LLM agents combine stochastic model outputs with deterministic software systems, yet the boundary between the two is rarely treated as a first-class architectural object. This paper names that boundary the stochastic-deterministic boundary (SDB): a four-part contract among a proposer, verifier, commit step, and reject signal that specifies how an LLM output becomes a system action. We argue that the SDB is the load-bearing primitive of production agent runtimes. Around this primitive, we organize agent runtime design into three concerns: Coordination, State, and Control. We present a catalog of six runtime patterns that compose the SDB differently across conversational, autonomous, and long-horizon agents: hierarchical delegation, scatter-gather plus saga, event-driven sequencing, shared state machine, supervisor plus gate, and human in the loop. For each pattern, we trace its lineage to distributed-systems concepts and identify what changes when the worker is stochastic. The paper contributes a five-step methodology for selecting runtime patterns, a diagnostic procedure that maps production failures to pattern weaknesses, and a failure mode called replay divergence, in which LLM-based consumers of a deterministic event log produce different downstream outputs under model-version or prompt changes. A stylized reliability decomposition separates per-call model variance from architectural momentum, motivating the claim that as model variance decreases, pattern choice and SDB strength become increasingly important levers for long-run reliability. We apply the methodology to five workloads and provide one runnable reference implementation for a 90-day contract-renewal agent.