Atropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model Hotswap

Apr 16, 2026•Naryeong Kim, Shin Yoo•View PDF

TL;DR Highlight

An agent optimization technique that achieves 74% of GPT-4o performance with only 23.9% of the cost by starting with SLM and switching to GPT-4 if failure is predicted.

Who Should Read

Developers considering replacing SLM with GPT-4 due to cost concerns when deploying LLM agents to production. Teams operating software engineering automation pipelines (bug detection, code modification).

Core Mechanics

Propose the ATROPOS framework, which predicts whether an agent currently running with Self-consistency (a technique for determining answers by running the same query multiple times and taking a majority vote) will ultimately fail and either terminates it early or switches to a more powerful model.
Represent the agent's reasoning path as a graph called SFG (Semantic Flow Graph) and use a GCN (Graph Convolutional Network, a neural network that learns graph structures) to perform binary classification on whether the current reasoning will succeed or fail.
Model Hotswap is a method of continuing execution by passing the inference context of SLMs such as Llama-3-8B or Mixtral to a more powerful model such as GPT-4o at the predicted failure point. This is possible because LLM queries are stateless (no state) and the context can simply be replayed.
Supports two strategies: Parallel Hotswap (running R inferences simultaneously up to k steps with SLM before switching) and Sequential Hotswap (completing the first k of R and then switching the rest to a more powerful model).
Evaluated on three software engineering agents: AutoFL, AutoCodeRover, and RepairAgent. Applied to Fault Localization (finding bug locations) and Automated Program Repair (automatic patch generation) tasks.
By clustering agent tool call arguments based on meaning using FastText embeddings to construct the SFG, structurally different calls with similar meanings can be grouped into the same node, increasing generalization capabilities.

Evidence

AutoCodeRover (based on GPT-4) trajectory prediction accuracy 0.93, AUROC 0.93. Accuracy 0.85, AUROC 0.85 achieved even at the intermediate step (k=8).
Parallel Hotswap results: 74.35% of GPT-4o performance achieved with only 23.90% of the cost in AutoFL. 27.57% of inferences predicted to fail were successfully converted.
In Sequential Hotswap, AutoCodeRover k=1~2 segments exceeded the performance of the Target (GPT-4) alone. Interpreted as an ensemble diversity effect of Mixtral+GPT-4.
Ablation results: AutoCodeRover accuracy dropped from 0.93 to 0.74 when semantic embeddings were removed. A further drop to 0.60 when function argument information was removed, confirming that semantic information is key to predictive power.

How to Apply

If you are operating a GPT-4-based agent, first run 10 samples of the same task with Llama-3 or Mixtral and construct an SFG, then predict the probability of success with a GCN, and hotswap to GPT-4 only for predicted failure cases to maintain 74% performance while reducing costs by 76%.
If you are using Self-consistency in a code generation/bug fixing pipeline, apply the sequential method (k=1~5 completion then judgment). This allows hotswap without re-execution, using only existing trajectory combinations. Tune the k value for cost-performance trade-offs.
If the agent is implemented with the ReAct pattern (tool call → observation → next action), the tool call sequence can be represented as an SFG. Because it involves embedding tool arguments with FastText and training a GCN, it can be applied and extended to new agents.

Code Example

snippet

# ATROPOS core flow pseudocode
# 1. SLM self-consistency sample execution (parallel)
agent_trajectories = []
for i in range(R):  # R=10 samples
    traj = run_agent_on_slm(task, source_model='llama-3-8b', max_steps=N)
    agent_trajectories.append(traj)
    
    # Early termination prediction every k steps
    if len(traj) == k:  # k = N//2 (intermediate point)
        sfg = build_sfg(agent_trajectories)  # Tool call sequence → graph
        prediction = gcn_model.predict(sfg)  # Predict success or failure
        
        if prediction == 'FAIL':
            # Hotswap: Transfer SLM context to a stronger model
            context = extract_context(agent_trajectories)  # For replaying existing trajectories
            remaining = run_agent_on_llm(
                task, 
                target_model='gpt-4o',
                context=context,  # Inject context up to k steps
                start_from_step=k
            )
            agent_trajectories[-1] = remaining

# 2. SFG construction (tool call → node, call order → edge)
def build_sfg(trajectories):
    nodes = {}  # unique reasoning steps
    edges = defaultdict(int)  # edge weight = frequency
    for traj in trajectories:
        for i, step in enumerate(traj[:-1]):
            # Node: FastText(function name + arguments) embedding
            node_embed = concat(onehot(step.func), fasttext(step.args))
            node_id = cluster_or_assign(node_embed, nodes)  # Meaning-based clustering
            next_id = cluster_or_assign(fasttext(traj[i+1].args), nodes)
            edges[(node_id, next_id)] += 1
    return GCNGraph(nodes, edges)

# 3. GCN training (binary classification: success=0, failure=1)
# Training data: Create partial trajectories by truncating completed trajectories at k steps
model = GCN(layers=3, hidden_dim=32, dropout=0.8)
model.train(truncated_sfgs, labels)  # 5-fold cross-validation

Terminology

Self-consistencyA technique for adopting the most frequent answer when asking the same question to an LLM multiple times. Similar to asking a test question to 10 people and considering the answer that 7 of them give as the correct one.

GCNGraph Convolutional Network. A neural network that learns data in graph form (nodes and edges). Like predicting human characteristics in a social network based on friend relationships, it predicts success or failure based on a graph of tool call patterns.

SFGSemantic Flow Graph. A representation of which tools an agent calls in what order as a node-edge graph. Common patterns in multiple execution paths are visualized as thicker edges.

Model HotswapSwitching LLMs during execution. Starting with a cheaper model and passing the conversation history to a more expensive model when failure is detected.

ReActReasoning + Acting. An agent pattern where the LLM repeats 'think → tool call → observe result → think again'. Solving complex problems by breaking them down into multiple steps and using external tools (APIs, code execution, etc.).

AUROCA metric for measuring how well a model distinguishes between success and failure. 0.5 is the level of flipping a coin, and 1.0 is perfect. 0.85 is a reasonably reliable level.

Fault LocalizationA task of automatically finding where bugs are in the code (which methods/files). Identifying where something went wrong by looking at test failure information and code.

Related Resources

ATROPOS source code and dataset (anonymous)

Original Abstract (Expand)

Open-weight Small Language Models(SLMs) can provide faster local inference at lower financial cost, but may not achieve the same performance level as commercial Large Language Models (LLMs) that are orders of magnitudes larger. Consequently, many of the latest applications of LLMs, such as software engineering agents, tend to be evaluated on larger models only, leaving the issue of improving the cost-benefit trade-off of such applications neglected. This paper proposes Atropos, a predictive early-termination analysis and hotswap technique that aims to improve the cost-benefit trade-off for LLM-based agents that use self-consistency. The core component of ATROPOS is a predictive model based on structural properties of LLM inferences: after merging multiple agentic inference paths into a graph representation, ATROPOS uses Graph Convolutional Network (GCN) to predict whether an ongoing inference will eventually succeed or not. If an agentic task instance running on the source LLM is predicted to fail, ATROPOS subsequently performs hotswapping, i.e., migrating the on-going inference context onto the more capable target LLM: this is feasible because LLM contexts are stateless. An empirical evaluation of ATROPOS using three recent LLM-based agents shows that ATROPOS can predict early termination of eventually failing inferences with the accuracy of 0.85 at the midpoint of the inference. Hotswapping LLMs for such inferences can convert up to 27.57% of them to be successful. Consequently, ATROPOS achieves 74.35% of the performance of closed LLMs with as low as only 23.9% of the cost.