AI 모델끼리 지시할 수 있을까? 조직 구조로 훈련 한계 탐색하기

Can AI Models Direct Each Other? Organizational Structure as a Probe into Training Limitations

Mar 27, 2026•Rui Liu•View PDF

TL;DR Highlight

비싼 AI가 싼 AI를 지휘하면 비싼 AI 단독 수준의 성능을 훨씬 저렴하게 달성할 수 있다 — 단, 능력 차이가 실제로 있을 때만.

Who Should Read

LLM 기반 코딩 에이전트나 멀티 에이전트 파이프라인을 설계 중인 백엔드/ML 엔지니어. 비용 대비 성능을 최적화해야 하는 상황에서 어떤 모델에 어떤 역할을 줄지 고민하는 개발자.

Core Mechanics

Claude Sonnet 4.6(매니저) + GPT-5-mini(워커) 조합이 62% 해결률로 Sonnet 단독(60%)과 맞먹으면서 강한 모델 토큰을 ~5배 절약함
능력 차이 없이 구조만 도입하면 오히려 역효과 — GPT-5-mini끼리 쓰면 44% → 42%로 성능 하락
매니저가 '리뷰만' 하면 +2pp 효과, '탐색+계획까지 직접 지휘'하면 +11pp 효과. 단순 감수보다 능동적 방향 제시가 핵심
매니저에게 저장소 접근 권한을 주면 오히려 성능 저하 — '이미 수정됨'이라고 환각을 일으킴. 텍스트만 받게 강제해야 제대로 추상적 사고를 함
현재 모델들은 '혼자 다 하는 에이전트'로 훈련돼 있어서, 위임(delegation)과 범위 제한 실행(scoped execution) 능력이 부족함
파이프라인 코드가 조직 구조 역할을 대신해야 함 — 페이즈 전환, 라운드 제한, 프롬프트 전략 전환을 모델에게 맡기면 안 됨

Evidence

MANAGERWORKER(Sonnet 4.6 매니저 + GPT-5-mini 워커) 62% vs Strong Direct(Sonnet 단독) 60% — 강한 모델 토큰은 ~30K→6.6K로 78% 절감 (200 인스턴스 기준)
Weak→Weak(GPT-5-mini 매니저+워커) 42% vs Weak Direct(GPT-5-mini 단독) 44% — 구조 도입으로 오히려 -2pp, 총 토큰은 75K로 최악 (50 인스턴스)
Simple Loop(리뷰만) 53% vs Full Pipeline(탐색+계획+구현) 62% — 구조화된 지시가 +9pp 추가 기여
Guided-then-Strict 전략이 Strict-only 대비 +10pp(52%→64%), 빈 패치(empty patch) 8개→0개로 감소 (50 인스턴스 ablation)

How to Apply

비용 최적화가 필요한 코딩 에이전트 파이프라인에서, 강한 모델은 텍스트 전용(파일 접근 없이) 분석/계획/리뷰만 맡기고 저렴한 모델은 실제 코드 실행/파일 탐색을 담당하도록 역할을 분리하면 됨
워커 프롬프트 전략을 '1라운드는 자율(guided), 재시도는 엄격 지시(strict)' 방식으로 설계하면 됨 — 1라운드에서 실제 코드 구조를 파악하고, 실패 시 매니저가 정확한 수정 지시를 내리는 패턴
매니저가 작업을 한 번에 3개 이하의 탐색 태스크로 분해하게 하고, 워커는 자연어 요약 보고서만 반환하도록 설계하면 컨텍스트 윈도우 폭발 없이 반복적 탐색이 가능함

Code Example

snippet

# Phase 1: Manager analyzes issue (text-only, no repo access)
ANALYSIS_PROMPT = """
You are a senior software engineer analyzing a GitHub issue.

## Issue
{problem_statement}

## Your Task
Identify 2-3 specific exploration tasks that a junior
engineer should perform to gather the information needed
to fix this issue.

Each task should be a focused investigation:
- "Find the X method in Y file and report its signature"
- "Check how Z handles the case when W is None"

Output each task on its own line starting with "TASK: "
"""

# Phase 4 Round 1: Guided (worker has autonomy)
IMPL_GUIDED_PROMPT = """
You are fixing a bug in '{repo}'.

## Analysis & Plan
{plan}

## Your Task
1. Read the relevant file(s) to understand current code structure.
2. Make the changes described in the plan.
3. Run 'git diff' to produce your patch.

Use your judgment on the exact implementation.
Do NOT modify test files. Keep changes minimal.
"""

# Phase 4 Round 2+: Strict (follow corrections exactly)
IMPL_STRICT_PROMPT = """
A senior engineer has reviewed your previous attempt.

## Corrected Instructions -- follow these EXACTLY
{prior_feedback}

## Rules
1. Apply ONLY the changes described above.
2. Do NOT modify test files.
3. Output the complete git diff as your final answer.
"""

Terminology

SWE-bench LiteAI가 실제 GitHub 이슈를 얼마나 잘 해결하는지 측정하는 벤치마크. 실제 Python 프로젝트의 버그를 코드 패치로 고치는 과제들로 구성됨.

monolithic agent분석-계획-코드작성-실행을 혼자 다 하는 단일 AI 에이전트. 현재 Claude Code, OpenHands 같은 대부분의 코딩 에이전트가 이 방식.

training distribution모델이 훈련 때 본 데이터의 패턴 범위. 모델은 이 범위 안의 작업은 잘 하지만 벗어나면 이상하게 행동하는 경향이 있음.

delegation내가 직접 하지 않고 다른 사람(에이전트)에게 작업을 맡기는 것. '파일 X의 Y 메서드를 찾아서 보고해줘'처럼 적절한 추상화 수준으로 지시하는 능력.

scoped execution주어진 범위 안에서만 실행하고 결과를 보고하는 능력. 큰 그림을 고민하지 않고 지시받은 것만 정확히 실행하는 것 — 현재 AI는 자꾸 범위를 벗어나려는 경향이 있음.

Pareto frontier비용과 성능 사이에서 한쪽을 희생하지 않고는 다른 쪽을 개선할 수 없는 최적 조합들의 집합. 여기선 토큰 비용 vs 해결률 트레이드오프를 분석하는 데 사용됨.

ablation특정 구성 요소를 제거하거나 단순화해서 그 요소의 기여도를 측정하는 실험. 예: '구조화된 탐색 페이즈를 없애면 성능이 얼마나 떨어지나?'

Related Resources

MANAGERWORKER GitHub Repository (SWE-bench eval infrastructure)

Original Abstract (Expand)

Can an expensive AI model effectively direct a cheap one to solve software engineering tasks? We study this question by introducing ManagerWorker, a two-agent pipeline where an expensive "manager" model (text-only, no code execution) analyzes issues, dispatches exploration tasks, and reviews implementations, while a cheap "worker" model (with full repo access) executes code changes. We evaluate on 200 instances from SWE-bench Lite across five configurations that vary the manager-worker relationship, pipeline complexity, and model pairing. Our findings reveal both the promise and the limits of multi-agent direction: (1) a strong manager directing a weak worker (62%) matches a strong single agent (60%) at a fraction of the strong-model token usage, showing that expensive reasoning can substitute for expensive execution; (2) a weak manager directing a weak worker (42%) performs worse than the weak agent alone (44%), demonstrating that the directing relationship requires a genuine capability gap--structure without substance is pure overhead; (3) the manager's value lies in directing, not merely reviewing--a minimal review-only loop adds just 2pp over the baseline, while structured exploration and planning add 11pp, showing that active direction is what makes the capability gap productive; and (4) these behaviors trace to a single root cause: current models are trained as monolithic agents, and splitting them into director/worker roles fights their training distribution. The pipeline succeeds by designing around this mismatch--keeping each model close to its trained mode (text generation for the manager, tool use for the worker) and externalizing organizational structure to code. This diagnosis points to concrete training gaps: delegation, scoped execution, and mode switching are skills absent from current training data.