MCP-Atlas: 실제 MCP 서버 기반 대규모 Tool-Use 역량 벤치마크

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

Jan 31, 2026•Chaithanya Bandi, Ben Hertzberg, Geobio Boo +13•View PDF

TL;DR Highlight

실제 MCP 서버 36개·툴 220개로 LLM 에이전트의 도구 사용 능력을 1,000개 태스크로 객관적으로 측정한 벤치마크.

Who Should Read

MCP 기반 AI 에이전트를 개발하거나 LLM의 툴 호출 성능을 평가하려는 백엔드/AI 엔지니어. 어떤 모델이 실제 멀티스텝 워크플로우에서 잘 동작하는지 비교 기준이 필요한 팀.

Core Mechanics

Claude Opus 4.5가 62.3%로 1위, Gemini 3 Pro 54.1%, GPT-5 44.5% 순이고 GPT-4o는 7.2%에 그쳐 모델 간 격차가 극단적으로 큼
실패 원인 1위는 '툴 자체를 호출 안 함'(Tool Usage 56.7%) — 툴을 잘못 쓰는 게 아니라 써야 한다는 걸 인식 못하는 게 문제
태스크의 1/3이 조건 분기(if-else 형태의 툴 호출 흐름)를 요구하고, 대부분이 2개 이상의 서버를 넘나드는 멀티서버 오케스트레이션 필요
Financial·Coding 도메인에서 툴 호출 실패율이 64~71%로 가장 높고, Analytics는 수치 계산 오류(Response Quality 14%)가 두드러짐
모델 성능이 올라갈수록 실패 원인이 '툴 선택 실패 → 오케스트레이션 실패 → 최종 답변 합성 실패' 순으로 뒤로 이동하는 패턴이 명확
claims-based rubric(독립 검증 가능한 사실 목록)으로 평가해 LLM-as-judge의 스타일 편향 없이 78% 인간 판단 일치율 달성

Evidence

Claude Opus 4.5 pass rate 62.3%, 2위 Gemini 3 Pro 54.1%와 8.2%p 차이, GPT-4o는 7.2%로 최하위권
전체 실패의 56.7%가 Tool Usage 오류이며, 그 중 '툴을 아예 호출 안 함'이 평균 36.0%로 단일 최대 실패 모드
claims-based 평가 임계값(0.65/0.75/0.85) 변화에도 모델 순위 Spearman 상관계수 ≥ 0.98로 순위 안정성 확인
Financial 서버 구문·타입 오류율 최대 45%, 에러 복구율 평균 60%

How to Apply

MCP 에이전트 프롬프트에 '사용 가능한 툴 목록을 먼저 확인하고 각 서브태스크에 필요한 툴을 명시적으로 선택하라'는 지시를 추가하면 'No tools called' 실패를 줄일 수 있음
Financial·Coding 도메인 에이전트는 날짜 포맷, 티커 심볼, 쿼리 syntax를 few-shot 예시로 제공하거나 스키마 조회(RAG)를 붙이면 파라미터 오류(24~28%) 대응 가능
Analytics 에이전트는 툴 호출 결과를 그대로 반환하지 말고 수치 계산 단계를 체인으로 분리(코드 실행기 활용)해 최종 합성 오류를 낮출 것

Code Example

snippet

# MCP 에이전트 시스템 프롬프트 예시 (Tool Awareness 개선)
SYSTEM_PROMPT = """
You are a tool-augmented assistant. Before answering any request:
1. List all available tools and identify which ones are relevant to the task.
2. Break the task into sub-goals and map each sub-goal to a specific tool.
3. If a tool returns no results, try alternative tools in the exposed set before concluding data is unavailable.
4. Do not stop until ALL sub-goals are addressed.

Available tools: {tool_list}
"""

Terminology

MCPAI 모델이 외부 툴·서버를 표준화된 방식으로 연결하는 프로토콜. USB-C처럼 어떤 LLM이든 같은 방식으로 툴을 꽂아 쓸 수 있게 하는 규격.

Tool-UseLLM이 외부 API나 함수를 직접 호출해 정보를 가져오거나 작업을 수행하는 능력. 검색창, 계산기, DB 쿼리 등을 직접 쓸 수 있게 되는 것.

claims-based rubric정답을 '독립적으로 검증 가능한 사실 목록'으로 분해해 채점하는 방식. 에세이 전체를 점수 매기는 대신 '이 사실이 맞냐' 체크리스트로 부분 점수를 주는 것.

Distractor정답 풀이에 필요 없지만 그럴듯하게 보여 혼동을 유발하는 가짜 툴. 실제 배포 환경처럼 에이전트가 툴을 정확히 골라내야 하는지 테스트하려고 섞어놓음.

LLM-as-judge다른 LLM이 답변 품질을 평가하는 방식. 빠르고 확장 쉽지만 긴 답변에 후한 점수를 주는 등 스타일 편향이 생길 수 있음.

Cross-server orchestration여러 MCP 서버(예: 검색 서버 + DB 서버 + 파일 서버)를 조합해 하나의 태스크를 완성하는 것. 마치 여러 부서에 흩어진 정보를 조합해 보고서 쓰는 것과 유사.

POMDP에이전트가 환경의 완전한 상태를 모르는 채로 의사결정을 해야 하는 수학적 모델. MCP 에이전트가 툴 결과를 보면서 다음 행동을 결정하는 구조를 이론적으로 표현한 것.

Related Resources

Original Abstract (Expand)

The Model Context Protocol (MCP) is rapidly becoming the standard interface for Large Language Models (LLMs) to discover and invoke external tools. However, existing evaluations often fail to capture the complexity of real-world scenarios, relying on restricted toolsets, simplistic workflows, or subjective LLM-as-a-judge metrics. We introduce MCP-Atlas, a large-scale benchmark for evaluating tool-use competency, comprising 36 real MCP servers and 220 tools. It includes 1,000 tasks designed to assess tool-use competency in realistic, multi-step workflows. Tasks use natural language prompts that avoid naming specific tools or servers, requiring agents to identify and orchestrate 3-6 tool calls across multiple servers. We score tasks using a claims-based rubric that awards partial credit based on the factual claims satisfied in the model's final answer, complemented by internal diagnostics on tool discovery, parameterization, syntax, error recovery, and efficiency. Evaluation results on frontier models reveal that top models achieve pass rates exceeding 50%, with primary failures arising from inadequate tool usage and task understanding. We release the task schema, containerized harness, and a 500-task public subset of the benchmark dataset to facilitate reproducible comparisons and advance the development of robust, tool-augmented agents.