ATROPOS: Early Termination과 Model Hotswap으로 LLM 기반 에이전트의 비용-성능 트레이드오프 개선

Atropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model Hotswap

Apr 16, 2026•Naryeong Kim, Shin Yoo•View PDF

TL;DR Highlight

SLM으로 시작해서 실패 예측되면 GPT-4로 갈아타는 방식으로, GPT-4o 성능의 74%를 비용 23.9%만 쓰고 달성하는 에이전트 최적화 기법

Who Should Read

LLM 에이전트를 프로덕션에 배포하면서 GPT-4 비용이 부담스러워 SLM 대체를 고민하는 개발자. 소프트웨어 엔지니어링 자동화 파이프라인(버그 탐지, 코드 수정)을 운영 중인 팀.

Core Mechanics

Self-consistency(같은 쿼리를 여러 번 실행해 다수결로 답 결정하는 기법)로 실행 중인 에이전트가 결국 실패할지를 중간 시점에 예측해서 조기 종료하거나 더 강한 모델로 교체하는 ATROPOS 프레임워크를 제안.
에이전트 추론 경로를 SFG(Semantic Flow Graph)라는 그래프로 표현하고, GCN(Graph Convolutional Network, 그래프 구조를 학습하는 신경망)으로 현재 진행 중인 추론이 성공할지 실패할지 이진 분류.
Model Hotswap은 실패가 예측된 시점에 Llama-3-8B나 Mixtral 같은 SLM의 추론 컨텍스트를 GPT-4o 같은 강한 모델로 이어받아 계속 실행하는 방식. LLM 쿼리가 stateless(상태 없음)라 컨텍스트를 그냥 replay하면 돼서 구현 가능.
Parallel Hotswap(R개 추론을 동시에 k 스텝까지 SLM으로 실행 후 전환)과 Sequential Hotswap(R개 중 처음 k개를 완료 후 나머지를 강한 모델로 전환) 두 가지 전략을 지원.
AutoFL, AutoCodeRover, RepairAgent 3개의 소프트웨어 엔지니어링 에이전트에서 평가. 각각 Fault Localization(버그 위치 찾기)과 Automated Program Repair(자동 패치 생성) 태스크에 적용.
FastText 임베딩으로 에이전트 툴 호출 인자를 의미 기반으로 클러스터링해서 SFG를 구성함으로써, 구조적으로 다른 호출도 의미가 비슷하면 같은 노드로 묶어 일반화 가능성을 높임.

Evidence

AutoCodeRover(GPT-4 기준) 완전 궤적 예측 정확도 0.93, AUROC 0.93. 추론 중간 시점(k=8)에서도 정확도 0.85, AUROC 0.85 달성.
Parallel Hotswap 결과: AutoFL에서 GPT-4o 성능의 74.35%를 비용 23.90%만으로 달성. 실패 예측된 추론 중 최대 27.57%를 성공으로 전환.
Sequential Hotswap에서 AutoCodeRover k=1~2 구간은 Target(GPT-4) 단독 성능을 초과. Mixtral+GPT-4 앙상블 다양성 효과로 해석.
Ablation 결과: 시맨틱 임베딩 제거 시 AutoCodeRover 정확도 0.93→0.74로 하락. 함수 인자 정보 제거 시 0.93→0.60으로 더 큰 하락, 의미 정보가 예측력의 핵심임을 확인.

How to Apply

GPT-4 기반 에이전트를 운영 중이라면, 동일 태스크를 먼저 Llama-3나 Mixtral로 10개 샘플 실행하고 SFG를 구성해 GCN으로 성공 가능성을 예측한 뒤 실패 예측 케이스만 GPT-4로 hotswap하면 비용을 76% 절감하면서 성능 74% 유지가 가능.
Self-consistency를 쓰는 코드 생성/버그 수정 파이프라인에서 sequential 방식(k=1~5 완료 후 판단)을 적용하면 re-execution 없이 기존 궤적 조합만으로 hotswap 가능. k값을 조절해 비용-성능 트레이드오프 튜닝 가능.
에이전트가 ReAct 패턴(툴 호출 → 관찰 → 다음 행동)으로 구현되어 있다면 툴 호출 시퀀스를 SFG로 표현 가능. FastText로 툴 인자를 임베딩하고 GCN을 학습시키는 방식이라 새로운 에이전트에도 적용 확장 가능.

Code Example

snippet

# ATROPOS 핵심 흐름 의사코드
# 1. SLM으로 self-consistency 샘플 실행 (parallel)
agent_trajectories = []
for i in range(R):  # R=10 샘플
    traj = run_agent_on_slm(task, source_model='llama-3-8b', max_steps=N)
    agent_trajectories.append(traj)
    
    # k 스텝마다 조기 종료 예측
    if len(traj) == k:  # k = N//2 (중간 시점)
        sfg = build_sfg(agent_trajectories)  # 툴 호출 시퀀스 → 그래프
        prediction = gcn_model.predict(sfg)  # 성공 여부 예측
        
        if prediction == 'FAIL':
            # Hotswap: SLM 컨텍스트를 강한 모델로 이전
            context = extract_context(agent_trajectories)  # 기존 궤적 replay용
            remaining = run_agent_on_llm(
                task, 
                target_model='gpt-4o',
                context=context,  # 지금까지 k 스텝 컨텍스트 주입
                start_from_step=k
            )
            agent_trajectories[-1] = remaining

# 2. SFG 구성 (툴 호출 → 노드, 호출 순서 → 엣지)
def build_sfg(trajectories):
    nodes = {}  # unique reasoning steps
    edges = defaultdict(int)  # edge weight = frequency
    for traj in trajectories:
        for i, step in enumerate(traj[:-1]):
            # 노드: FastText(함수명 + 인자) 임베딩
            node_embed = concat(onehot(step.func), fasttext(step.args))
            node_id = cluster_or_assign(node_embed, nodes)  # 의미 기반 클러스터링
            next_id = cluster_or_assign(fasttext(traj[i+1].args), nodes)
            edges[(node_id, next_id)] += 1
    return GCNGraph(nodes, edges)

# 3. GCN 학습 (이진 분류: 성공=0, 실패=1)
# 학습 데이터: 완료된 궤적을 k 스텝에서 truncate해서 부분 궤적 생성
model = GCN(layers=3, hidden_dim=32, dropout=0.8)
model.train(truncated_sfgs, labels)  # 5-fold cross-validation

Terminology

Self-consistency같은 질문을 LLM에게 여러 번 던져서 가장 많이 나온 답을 채택하는 기법. 시험 문제를 10명한테 물어보고 7명이 같은 답 내면 그게 정답이라고 판단하는 것과 비슷.

GCNGraph Convolutional Network. 그래프 형태의 데이터(노드와 엣지)를 학습하는 신경망. 소셜 네트워크에서 친구 관계로 사람 특성 예측하듯, 툴 호출 패턴 그래프로 성공 여부 예측.

SFGSemantic Flow Graph. 에이전트가 어떤 툴을 어떤 순서로 호출했는지를 노드-엣지 그래프로 표현한 것. 여러 실행 경로의 공통 패턴이 더 굵은 엣지로 시각화됨.

Model Hotswap실행 도중 LLM을 교체하는 것. 싼 모델로 시작했다가 실패 조짐 보이면 비싼 모델로 지금까지의 대화 맥락을 넘겨서 이어받게 하는 방식.

ReActReasoning + Acting. LLM이 '생각 → 툴 호출 → 결과 관찰 → 다시 생각'을 반복하는 에이전트 패턴. 복잡한 문제를 여러 단계로 쪼개서 외부 도구(API, 코드 실행 등)를 써가며 해결.

AUROC모델이 성공/실패를 얼마나 잘 구분하는지 측정하는 지표. 0.5는 동전 던지기 수준, 1.0은 완벽. 0.85면 꽤 신뢰할 수 있는 수준.

Fault Localization버그가 코드의 어느 메서드/파일에 있는지 자동으로 찾아내는 태스크. 테스트 실패 정보와 코드를 보고 어디가 잘못됐는지 특정하는 것.

Related Resources

ATROPOS 소스코드 및 데이터셋 (anonymous)

Original Abstract (Expand)

Open-weight Small Language Models(SLMs) can provide faster local inference at lower financial cost, but may not achieve the same performance level as commercial Large Language Models (LLMs) that are orders of magnitudes larger. Consequently, many of the latest applications of LLMs, such as software engineering agents, tend to be evaluated on larger models only, leaving the issue of improving the cost-benefit trade-off of such applications neglected. This paper proposes Atropos, a predictive early-termination analysis and hotswap technique that aims to improve the cost-benefit trade-off for LLM-based agents that use self-consistency. The core component of ATROPOS is a predictive model based on structural properties of LLM inferences: after merging multiple agentic inference paths into a graph representation, ATROPOS uses Graph Convolutional Network (GCN) to predict whether an ongoing inference will eventually succeed or not. If an agentic task instance running on the source LLM is predicted to fail, ATROPOS subsequently performs hotswapping, i.e., migrating the on-going inference context onto the more capable target LLM: this is feasible because LLM contexts are stateless. An empirical evaluation of ATROPOS using three recent LLM-based agents shows that ATROPOS can predict early termination of eventually failing inferences with the accuracy of 0.85 at the midpoint of the inference. Hotswapping LLMs for such inferences can convert up to 27.57% of them to be successful. Consequently, ATROPOS achieves 74.35% of the performance of closed LLMs with as low as only 23.9% of the cost.