HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents | AI Paper Digest

TL;DR Highlight

여러 MCP 툴 호출을 코드 블록 하나로 묶어 LLM 에이전트의 컨텍스트 낭비와 추론 단절을 동시에 해결하는 기법

Who Should Read

MCP(Model Context Protocol) 기반 멀티 툴 에이전트를 개발하면서 컨텍스트 길이 초과나 중간 결과물이 너무 많아 성능이 떨어지는 문제를 겪고 있는 AI 에이전트 개발자. LLM이 여러 API를 연속 호출해야 하는 복잡한 워크플로우를 설계하는 백엔드 개발자에게도 유용.

Core Mechanics

기존 ReAct 방식은 모든 툴 호출 결과를 메인 추론 트레이스에 그대로 써서 컨텍스트가 폭발적으로 늘어나는 'context inflation' 문제가 있음. 예를 들어 두 지점 간 거리를 구하려면 geocoding 2번 + 거리 계산 1번이 모두 트레이스에 노출됨.
HyperTool은 여러 툴 호출을 하나의 실행 가능한 코드 블록으로 묶어 중간 결과는 블록 안에서만 처리하고 최종 결과만 메인 트레이스에 반환. 기존 MCP 툴 스키마를 그대로 유지하면서 call_tool() 함수로 래핑.
HyperTool 블록은 5가지 유형으로 분류됨: Atomic(단순 래퍼), Chaining(순차 툴 체인), Transform(결과 파싱/필터링), Aggregate(루프로 다수 툴 호출 후 집계), Helper(재사용 가능한 내부 함수 정의).
학습 데이터는 3단계 파이프라인으로 합성: ① 진짜 멀티툴 조합이 필요한 문제 생성 → ② 실제 MCP 환경에서 HyperTool 트라젝토리 롤아웃(컨텍스트 압축 + 실패 블록 로컬 수리 포함) → ③ 실행 정확성 + 증거 일관성 필터링으로 검증.
훈련 데이터로 10,422개의 검증된 HyperTool 트라젝토리를 사용하며, GLM-5.1이 트라젝토리를 생성하고 GPT-4o가 판단자로 필터링함.
단일 통합 인터페이스(HyperTool-only)가 Atomic+HyperTool 하이브리드보다 성능이 높음. 작은 모델이 두 인터페이스 사이에서 동적으로 전환하는 것 자체가 인지 부하를 증가시키기 때문.

Evidence

Qwen3-8B 기준 평균 정확도가 9.93%에서 33.33%로, Qwen3-32B는 15.69%에서 35.29%로 향상. GPT-OSS(32.13%), Gemini-2.5-Flash(25.58%)를 8B 모델로 능가.
Financial Analysis 도메인에서 HyperTool은 ReAct-SFT 대비 정확도 2배(32.5%→62.5%)를 달성하면서 토큰 소비는 78% 절감(916k→199k).
HyperTool은 실제 사용한 툴 수(47.55)가 ReAct-SFT(26.92)보다 많지만, 모델에게 노출되는 툴 호출 횟수(26.92→20.76)와 총 인터랙션 턴(28.81→21.76)은 오히려 감소.
Execution Filtering과 Evidence Filtering을 모두 제거하면 최종 평균 정확도가 33.33%에서 각각 21.05%, 18.06%로 폭락. 엄격한 데이터 검증의 중요성을 입증.

How to Apply

기존 MCP 에이전트에 HyperTool을 추가할 때, 기존 툴들을 모두 시스템 프롬프트에 명시하고 에이전트가 직접 호출할 수 있는 유일한 툴로 HyperTool만 노출. 내부에서는 call_tool(server_name, tool_name, params) 형태로 기존 툴을 호출하게 하면 됨.
데이터 집계나 필터링이 많은 Financial Analysis, 복잡한 검색-비교-랭킹 워크플로우에 특히 효과적. 예: 여러 후보를 루프로 조회하고 점수를 계산해 최적값을 선택하는 로직을 하나의 HyperTool 블록으로 구현하면 중간 결과물이 트레이스에 쌓이지 않음.
자체 HyperTool SFT 데이터를 만들 때는 GLM-5.1 같은 teacher 모델로 트라젝토리를 합성하고, LLM judge로 '실행 정확성'과 '증거 일관성'을 모두 검증하는 2단계 필터링을 반드시 적용. 어느 하나라도 빠지면 SFT 성능이 크게 저하됨.

Code Example

snippet

# HyperTool 에이전트 프롬프트 시스템 메시지 핵심 부분
system_prompt = """
You are a helpful assistant.

HyperTool Strategy:
- When the workflow is deterministic and predictable, use a SINGLE HyperTool code block.
- When next step depends heavily on interpreting semantics of previous outputs, call tools one by one.

Execution Rules:
1. The ONLY tool you can call directly is HyperTool.
2. Inside HyperTool, use: call_tool(server_name, tool_name, {"param": "value"})
3. Always assign final result to variable named 'result'.
4. Do NOT use print statements. Do NOT add comments (#) in the code block.
5. Variables from different HyperTool blocks are NOT reusable.
"""

# HyperTool 블록 예시: 두 지점 간 거리 최적 카페 찾기
hypertool_block_example = """
j_gateway_addr = "J Gateway Condo, Singapore"
park_place_addr = "Park Place Residences at PLQ, Singapore"
cafe_addresses = ["Cafe A, Singapore", "Cafe B, Singapore"]
cafe_names = ["Cafe A", "Cafe B"]
cafe_place_ids = ["id_a", "id_b"]

driving_result = call_tool("google-maps", "maps_distance_matrix", {
    "origins": [j_gateway_addr],
    "destinations": cafe_addresses,
    "mode": "driving"
})
walking_result = call_tool("google-maps", "maps_distance_matrix", {
    "origins": [park_place_addr],
    "destinations": cafe_addresses,
    "mode": "walking"
})

results_list = []
for i in range(len(cafe_names)):
    d = driving_result["result"]["results"][0]["elements"][i]["duration"]["value"]
    w = walking_result["result"]["results"][0]["elements"][i]["duration"]["value"]
    results_list.append({"name": cafe_names[i], "place_id": cafe_place_ids[i], "diff": abs(d - w)})

results_list.sort(key=lambda x: x["diff"])
result = results_list
"""

Terminology

MCP (Model Context Protocol)LLM이 외부 툴/API를 표준화된 방식으로 호출할 수 있게 해주는 인터페이스 규약. 툴 이름, 입력 스키마, 호출 형식을 정의해서 에이전트가 어떤 툴이든 같은 방식으로 쓸 수 있게 함.

ReActLLM이 Reasoning(추론)과 Acting(툴 호출)을 번갈아가며 수행하는 에이전트 패턴. 생각하고 → 툴 호출하고 → 결과 보고 → 다시 생각하는 루프를 반복.

SFT (Supervised Fine-Tuning)모범 답안 데이터를 보여주고 그걸 따라하도록 모델을 추가 학습시키는 방법. 학교에서 예제 풀이 보고 따라 푸는 것과 비슷.

context inflation에이전트가 툴을 많이 쓸수록 중간 결과물이 대화 컨텍스트에 계속 쌓여서 컨텍스트 창이 꽉 차버리는 현상. 긴 영수증처럼 필요 없는 항목도 모두 남아있는 상황.

트라젝토리 (Trajectory)에이전트가 문제를 푸는 전체 과정 기록. 어떤 생각을 했고, 어떤 툴을 호출했고, 어떤 결과를 받았는지의 시퀀스.

RCE (Remote Code Execution)악의적인 입력으로 서버에서 임의의 코드를 실행시킬 수 있는 보안 취약점. HyperTool처럼 코드를 동적으로 실행하는 시스템에서 반드시 샌드박스로 격리해야 함.

CodeActLLM 에이전트의 행동을 코드로 표현하는 프레임워크. 툴 호출 대신 파이썬 코드를 직접 생성해서 실행하는 방식.

Related Papers

Related Resources

Original Abstract (Expand)

Tool-augmented LLM agents commonly rely on step-wise atomic tool calls, where each invocation, observation, and value transfer is exposed in the main reasoning trace. This creates an \emph{execution-granularity mismatch}: locally deterministic tool workflows are unfolded into repeated model-visible decisions, consuming context and forcing the model to manage low-level dataflow in the trace. We introduce \textbf{HyperTool}, a unified executable MCP-style tool interface that changes the model-visible unit of tool execution. A model invokes HyperTool with a code block that can call existing tools through their original schemas, manipulate returned values, and pass intermediate results locally, folding deterministic tool subroutines into a single outer call. To train models to use this interface, we synthesize HyperTool-format trajectories from cross-tool compositional tasks and verify them in real MCP environments. On MCP-Universe, HyperTool improves average accuracy from 15.69\% to 35.29\% on Qwen3-32B and from 9.93\% to 33.33\% on Qwen3-8B, and surpass GPT-OSS and Kimi-k2.5 on average accuracy, showing that our HyperTool can substantially improve multi-step tool use.