Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability | AI Paper Digest

TL;DR Highlight

실제 환경처럼 API가 망가지거나 결과가 이상할 때 LLM 에이전트가 얼마나 잘 버티는지 측정하는 벤치마크 ToolBench-X 공개.

Who Should Read

LLM 기반 에이전트에서 외부 API/툴을 붙여 실서비스를 운영 중인 백엔드 개발자나 AI 엔지니어. 특히 툴 호출 실패, 응답 포맷 변경, 타임아웃 같은 런타임 예외 상황을 어떻게 처리할지 고민하는 분들에게 유용.

Core Mechanics

기존 벤치마크(BFCL, ToolBench 등)는 API가 항상 정상 동작한다고 가정하는데, 실제론 문서가 낡거나 타임아웃이 나거나 서로 다른 툴이 충돌하는 값을 반환하는 경우가 흔함.
ToolBench-X는 5가지 장애 유형을 주입함: Specification Drift(API 스펙 변경), Invocation Error(인자 전달 실패), Execution Failure(타임아웃/런타임 에러), Output Drift(응답 포맷 변화), Cross-source Conflict(여러 툴 간 결과 충돌).
핵심 설계 원칙은 '회복 가능성' — 장애를 주입해도 retry, fallback, 정규화, 교차 검증 중 최소 하나의 정답 경로는 반드시 남겨둠.
12개 주요 모델(GPT-5.4, DeepSeek-V4-Pro, Claude-Sonnet-4.6, Doubao-Seed-2.0-Lite, Qwen-3.5-35B 등) 모두 성공률 60% 미만. 최고 성능인 Doubao-Seed-2.0-Lite도 51.3%에 그침.
실패의 핵심 원인은 툴 호출 횟수나 추론 예산(compute)이 아니라 '장애 진단 능력' 부재. 장애 힌트를 주면 성공률이 25~35 포인트 급등하지만, 추가 추론 라운드만 주면 3~11 포인트 향상에 그침.
Qwen 패밀리 비교에서 thinking 모드(명시적 추론)가 파라미터 스케일링보다 효과적: Qwen-3.5-35B-Thinking이 비-thinking 35B보다 4.7 포인트 높고, 비-thinking 35B가 더 작은 Qwen-3.0-30B보다 오히려 낮음.

Evidence

12개 모델 전부 전체 성공률 60% 미만. 최고는 Doubao-Seed-2.0-Lite 51.3%, GPT-5.4 45.3%, DeepSeek-V4-Pro 42.5%, Claude-Sonnet-4.6 41.0% 순.
장애 힌트(Hint) 제공 시 Baseline 대비 25.5~35.5 포인트 향상, 실패 태스크의 60~80%를 회복. 반면 추가 추론 라운드(Test-Time Scaling) 제공 시 3.5~11.5 포인트 향상에 그침.
GPT-5.4-Mini 기준: Hint 조건 84.0% vs TTS 조건 52.0% — 진단 정보가 추가 compute보다 32 포인트 이상 더 효과적.
툴 체인 길이별 성능 저하 확인: Doubao-Seed-2.0-Lite는 툴 3개 이하 태스크에서 57.7%지만 5개 이상에서 46.2%로 하락. Gemini-3.1-Flash-Lite는 44.2%→31.8%로 급락.

How to Apply

에이전트 시스템에서 툴 호출 실패 시 단순 retry가 아니라 '왜 실패했는지'를 먼저 분류하는 진단 레이어를 추가하면 회복률을 크게 높일 수 있음. 예: 타임아웃이면 retry, 포맷 변경이면 파싱 로직 수정, 여러 소스 충돌이면 교차 검증.
Production 에이전트에서 API 응답이 예상 스키마와 다를 때 즉시 실패 처리하지 말고, Output Drift 회복 전략(필드 정규화, 단위 변환, 중첩 구조 언팩)을 파이프라인에 내장하면 됨.
자체 에이전트 벤치마크 구축 시, 기존처럼 '정답 툴을 골랐는가'만 평가하지 말고 ToolBench-X 방식처럼 장애를 주입한 후 '최종 태스크를 완료했는가'로 평가 기준을 바꿔야 실서비스 신뢰도를 제대로 측정 가능.

Code Example

snippet

# ToolBench-X 스타일의 장애 진단 + 회복 에이전트 패턴 예시

def tool_call_with_recovery(tool_fn, args, max_retries=2):
    """
    툴 호출 시 장애 유형을 진단하고 적절한 회복 전략을 선택.
    """
    for attempt in range(max_retries + 1):
        try:
            result = tool_fn(**args)
            
            # Output Drift 감지: 예상 필드가 없거나 형식이 다름
            if not validate_output_schema(result):
                result = normalize_output(result)  # 정규화 시도
            
            if result.get('success') is False or result.get('ok') is False:
                raise ValueError(f"Tool returned failure: {result}")
            
            return result
        
        except TimeoutError:
            # Execution Failure -> retry
            if attempt < max_retries:
                print(f"[Execution Failure] Timeout. Retrying {attempt+1}/{max_retries}...")
                continue
            return try_fallback_tool(args)  # fallback 툴로 전환
        
        except KeyError as e:
            # Specification Drift -> 런타임 스키마로 재매핑
            print(f"[Specification Drift] Missing field: {e}. Remapping...")
            args = remap_args_to_runtime_schema(args, e)
        
        except TypeError as e:
            # Invocation Error -> 인자 구조 수정
            print(f"[Invocation Error] Argument error: {e}. Repairing args...")
            args = repair_invocation_args(args, e)
    
    return None


def cross_source_verify(results: list):
    """
    Cross-source Conflict: 여러 툴 결과를 교차 검증.
    단순 다수결 X -> 원본 기준값과 대조 후 정규화된 값만 채택.
    """
    normalized = [normalize_value(r) for r in results if r is not None]
    
    # 일치하는 값만 신뢰
    from collections import Counter
    counts = Counter(normalized)
    most_common_val, count = counts.most_common(1)[0]
    
    if count >= len(results) // 2 + 1:  # 과반수 동의
        return most_common_val
    else:
        # 충돌 해결 불가 -> 추가 검증 요청
        raise ValueError(f"Cross-source conflict unresolved: {counts}")


# 에이전트 정책 프롬프트 (논문 Table 13 기반)
POLICY_PROMPT = """
You are a tool-orchestration policy model.

Rules:
- If a tool fails, diagnose WHY before retrying:
  * TimeoutError -> retry once, then fallback
  * Missing field -> remap to observed runtime schema  
  * Argument error -> repair argument structure
  * Conflicting results -> cross-check all sources before finalizing
- Finish ONLY when all required evidence branches are resolved.
- Treat 'success=false', missing fields, and errors as UNRESOLVED.
- Do not mistake partial results for final answers.

Return JSON: {"action": "call_tool|retry|fallback|finish", "reason": "..."}
"""

Terminology

Specification DriftAPI 문서와 실제 동작이 달라진 상태. 예전엔 'price' 필드였는데 어느 날 'unit_price'로 바뀐 것처럼, 문서는 안 바뀌었는데 실제 응답 스키마가 달라진 경우.

Invocation Error올바른 툴을 골랐는데 인자 전달 과정에서 문제가 생긴 것. 미들웨어나 어댑터가 파라미터를 잘못 변환하거나 누락시키는 케이스.

Output Drift툴이 응답은 줬는데 형식이 예상과 다른 경우. 숫자를 '100'으로 줬는데 '100.0 USD' 처럼 단위가 붙거나 중첩 객체로 감싸져 오는 것.

Cross-source Conflict여러 API를 병렬로 호출했을 때 서로 다른 답을 주는 상황. 어떤 소스가 맞는지 교차 검증 없이 첫 번째 결과를 믿으면 틀림.

Test-Time Scaling추론 시간을 늘려서 모델이 더 많이 생각하게 하는 기법. 실패 후 추가 라운드를 주어 스스로 다시 시도하게 함. 이 논문에서는 힌트 없이 compute만 늘리는 건 효과가 제한적임을 보임.

MDPMarkov Decision Process의 약자. 에이전트가 상태를 보고 행동을 선택하면 환경이 바뀌는 과정을 수학적으로 모델링한 것. 게임 캐릭터가 맵을 보고 이동하는 것과 비슷한 구조.

Exact-match Evaluation모델 출력과 정답 문자열이 완전히 일치해야만 맞다고 보는 엄격한 평가 방식. '100'과 '100.0'을 다르게 취급함.

Fallback주 툴이 실패했을 때 대신 사용하는 대체 툴이나 전략. 1차 API가 죽으면 2차 API로 전환하는 것처럼.

Related Papers

Related Resources

ToolBench-X GitHub Repository

Original Abstract (Expand)

Large language models are increasingly deployed as agents that solve tasks by interacting with external tool environments. Although recent tool-use benchmarks increasingly cover complex task settings, they still largely assume clean, stable, and trustworthy tool environments, leaving tool-environment unreliability insufficiently examined. We introduce ToolBench-X, a benchmark for evaluating agents under recoverable reliability hazards. ToolBench-X contains executable multi-step tasks across diverse domains and sequential, parallel, and mixed workflows, each paired with deterministic tools and a canonical final answer for automatic evaluation. Starting from clean tool environments, ToolBench-X injects five structured hazard types: Specification Drift, Invocation Error, Execution Failure, Output Drift, and Cross-source Conflict. Crucially, each injected instance remains solvable through at least one valid recovery path, such as retrying, fallback, verification, or cross-checking. Experiments reveal a substantial reliability gap: agents that perform well with reliable tools often fail under recoverable hazards. Further analysis shows that failures are driven less by tool-use volume or inference budget than by limited hazard diagnosis and ineffective recovery. Targeted recovery hints recover many failed tasks, while test-time scaling yields more limited gains. These results suggest that tool-use evaluation should move beyond function-call accuracy toward task completion under unreliable tool environments. The code and data is available at https://github.com/Foreverskyou/ToolBench-X.