MCP-RADAR: LLM의 Tool Use 능력을 평가하는 다차원 Benchmark

MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models

May 22, 2025•Xuanqi Gao, Siyi Xie, Juan Zhai +2•View PDF

TL;DR Highlight

MCP(Model Context Protocol) 환경에서 GPT-5, Gemini, Claude 등 10개 LLM의 도구 사용 능력을 6개 도메인, 507개 태스크로 객관적으로 측정한 첫 번째 벤치마크.

Who Should Read

MCP 기반 AI 에이전트를 개발하거나 LLM의 tool calling 성능을 비교해야 하는 백엔드/AI 엔지니어. 어떤 모델이 이메일, 파일 관리, 터미널 등 실제 작업에서 얼마나 잘 동작하는지 벤치마크 데이터가 필요한 팀.

Core Mechanics

수학 추론에서는 closed-source 모델(GPT-5, Gemini-2.5-Pro)이 open-source를 크게 앞서지만, 웹 검색에서는 격차가 10% 미만으로 좁혀짐 — 전 모델 웹 검색 정확도 30% 미만
GPT-4o가 Filemanagement에서 DTSR(단계별 도구 호출 성공률) 84.5%를 달성했지만 최종 정확도는 43% — 문법적으로 도구를 호출하더라도 의미적으로 맞는 도구를 고르는 건 별개 문제
도구 개수 힌트를 줘도 성능 향상이 2.5~5%에 불과 — 모델의 진짜 병목은 '도구가 필요한지'가 아니라 '어떤 도구를, 어떤 파라미터로' 쓸지임
오픈소스 중에서는 Qwen3-235B가 정확도와 토큰 효율(CRE) 균형이 가장 좋고, Gemini-2.5-Pro가 전 도메인에서 가장 안정적인 closed-source 모델
가장 흔한 실패 유형: Fuzzy Match에서 Parameter Error(34.6%)와 Inaccurate Tool Invocation(43.2%) — 파라미터를 잘못 채우거나 용도가 비슷한 엉뚱한 도구를 고르는 케이스
대화 라운드 K=10이 성능-비용 스위트스팟 — K≥10 이후 정확도 향상이 포화되기 시작

Evidence

Gemini-2.5-Pro가 웹 검색 정확도 29.8%로 1위, Llama-4는 0.8%로 꼴찌 — 오픈소스 평균 10.8% vs 클로즈드 평균 20.7%
GPT-4o Filemanagement: DTSR 84.5% vs 최종 ACC 43% — 40.7%p 갭으로 도구 실행 능력과 문제 해결 능력의 괴리를 수치로 증명
수학 도메인 최고: Gemini-2.5-Flash ACC 0.612 / 최저: Llama-4 ACC 0.128 — 약 5배 차이
도구 개수 힌트(2/2 조건) 제공 시 평균 정확도 개선폭이 2.5~5%에 불과 (Table 4 기준)

How to Apply

MCP 도구를 설계할 때는 'atomic tool' 원칙을 적용하라 — 하나의 도구가 한 가지 기능만 하도록 쪼개면 LLM이 파라미터 오류 없이 조합해서 사용함 (복잡한 멀티기능 도구보다 단순 도구 여러 개가 정확도 높음)
MCP 시스템 프롬프트 작성 시 도구 설명을 concise하게 유지하라 — ReAct vs Concise 실험에서 Gemini-2.5-Pro는 Concise 프롬프트가 +10.2%p 높았음. verbose한 설명이 오히려 LLM의 인지 부하를 높임
에이전트 파이프라인에서 최대 대화 라운드를 K=10으로 설정하는 것이 실용적 — K=5 대비 K=10에서 Gemini-2.5-Pro 기준 0.365→0.614로 정확도 향상, K=15에서도 0.622로 수렴

Code Example

snippet

# MCP-RADAR 스타일의 도구 정의 예시 (atomic tool 원칙 적용)
# 나쁜 예: 하나의 도구에 너무 많은 기능
bad_tool = {
    "name": "EmailManager",
    "description": "Send, read, draft, delete, label emails and manage attachments",
    "inputs": ["action", "to", "subject", "body", "labels", "attachments", ...]
}

# 좋은 예: atomic하게 분리
good_tools = [
    {
        "name": "SendEmail",
        "description": "Send a single email to one or more recipients.",
        "inputs": ["to", "subject", "body"]
    },
    {
        "name": "DraftEmail",
        "description": "Save an email as draft without sending.",
        "inputs": ["to", "subject", "body"]
    },
    {
        "name": "LabelEmail",
        "description": "Add or remove a label from an existing email by message_id.",
        "inputs": ["message_id", "label", "action"]
    }
]

# Concise 시스템 프롬프트 예시 (ReAct보다 성능 좋은 케이스 多)
system_prompt = """
You are a helpful assistant with access to MCP tools.
Rules:
- ALWAYS use tools to complete tasks. Do NOT answer from memory.
- Select the most specific tool for the task.
- Format your final answer as: <answer>[YOUR ANSWER]</answer>
"""

# 평가 메트릭 계산 예시
def compute_cre(token_used, token_min, token_max):
    """Computational Resource Efficiency (낮을수록 토큰 효율적)"""
    return (token_used - token_min) / (token_max - token_min + 1e-9)

def check_fuzzy_match(pred_tool, pred_args, gt_tool, gt_args):
    """Fuzzy Match 정확도: 도구 이름 + 핵심 파라미터 모두 맞아야 성공"""
    return pred_tool == gt_tool and pred_args == gt_args

Terminology

MCP (Model Context Protocol)Anthropic이 만든 LLM과 외부 도구를 연결하는 표준 규격. USB-C 포트처럼, 어떤 LLM이든 MCP를 지원하면 동일한 방식으로 도구(검색, 이메일, 파일 등)를 호출할 수 있음.

DTSR (Dialogue Turn Success Rate)대화 한 턴 한 턴마다 도구를 얼마나 정확하게 호출했는지의 비율. 최종 정답과 별개로, 중간 과정이 올바른지 측정하는 지표.

CRE (Computational Resource Efficiency)정답을 맞추는 데 토큰(처리 비용)을 얼마나 효율적으로 썼는지. 같은 정확도라면 토큰을 덜 쓴 모델이 CRE 점수가 높음.

Fuzzy Match정답이 딱 하나의 텍스트가 아니라, '올바른 도구를 올바른 파라미터로 호출했는지'로 성공을 판단하는 평가 방식. 이메일 전송, 파일 관리 같은 작업에 사용.

Precise Answer수학 계산 결과나 웹 검색 정답처럼 정확히 하나의 정답이 있는 태스크 유형.

Atomic Tool딱 하나의 기능만 수행하는 작은 도구. 여러 기능을 합친 복잡한 도구보다 LLM이 정확하게 사용하기 쉬움. 레고 블록처럼 단순한 조각을 조합해서 복잡한 작업을 수행.

ReActThought(생각) → Action(행동) → Observation(결과 확인)을 반복하며 문제를 푸는 LLM 프롬프트 패턴. 마치 요리사가 '재료 확인 → 조리 → 맛 확인'을 반복하는 것과 유사.

Tool HallucinationLLM이 실제로 존재하지 않는 도구를 호출하거나, 도구 없이 내부 지식만으로 답을 지어내는 현상.

Related Resources

Original Abstract (Expand)

As Large Language Models (LLMs) evolve from passive text generators to active reasoning agents capable of interacting with external tools, the Model Context Protocol (MCP) has emerged as a key standardized framework for dynamic tool discovery and orchestration. Despite its widespread industry adoption, existing evaluation methods do not adequately assess tool utilization capabilities under this new paradigm. To address this gap, this paper introduces MCP-RADAR, the first comprehensive benchmark specifically designed to evaluate LLM performance within the MCP framework. MCP-RADAR features a challenging dataset of 507 tasks spanning six domains: mathematical reasoning, web search, email, calendar, file management, and terminal operations. It quantifies performance based on two primary criteria: answer correctness and operational accuracy. To closely emulate real-world usage, our evaluation employs both authentic MCP tools and high-fidelity simulations of official tools. Unlike traditional benchmarks that rely on subjective human evaluation or binary success metrics, MCP-RADAR adopts objective, quantifiable measurements across multiple task domains, including computational resource efficiency and the number of successful tool-invocation rounds. Our evaluation of leading closed-source and open-source LLMs reveals distinct capability profiles and highlights a significant trade-off between accuracy and efficiency. Our findings provide actionable insights for both LLM developers and tool creators, establishing a standardized methodology applicable to the broader LLM agent ecosystem. All implementations, configurations, and datasets are publicly available at https://anonymous.4open.science/r/MCPRadar-B143.