LLM 에이전트의 검증 가능한 안전한 도구 사용을 향하여

Towards Verifiably Safe Tool Use for LLM Agents

Jan 12, 2026•A. Doshi, Yining Hong, Congying Xu +3•View PDF

TL;DR Highlight

LLM 에이전트가 툴을 호출할 때 민감 데이터 유출이나 잘못된 동작을 확률적 필터 대신 수학적으로 보장된 규칙으로 막는 설계 방법론 제안

Who Should Read

MCP나 LangChain으로 에이전트를 만들면서 민감 데이터 유출·잘못된 툴 조합 같은 보안 사고를 걱정하는 백엔드/AI 개발자. 엔터프라이즈 환경에서 에이전트 안전 아키텍처를 설계해야 하는 시니어 개발자나 아키텍트.

Core Mechanics

에이전트 사고의 대부분은 개별 툴 버그가 아니라 여러 툴이 조합될 때 예상 못 한 데이터 흐름에서 발생 — GitHub MCP 취약점 사례에서 파일 읽기 + 공개 커밋 두 툴 조합이 private 저장소 정보를 유출
GuardAgent, ShieldAgent, TrustAgent 같은 ML 기반 가드레일은 확률적으로만 위험을 줄이므로, 집요한 공격자는 방어 특성에 맞춘 공격 한 번으로 뚫을 수 있음
STPA(항공·자율주행에서 쓰는 시스템 안전 분석법)를 LLM 에이전트에 적용해 이해관계자 → 손실 → 위험 행동 → 안전 요구사항 순으로 위험을 사전 도출
IFC(Information Flow Control, 데이터가 어디서 어디로 흐르는지 추적·제어하는 기법)를 MCP 툴 경계에 적용해 안전 위반을 결정론적으로 차단
MCP를 확장해 각 툴에 capabilities(read/write/execute), confidentiality(공개/민감), trust_level 레이블을 필수 태그로 붙이는 구조 제안 — 현재 MCP는 이 정보가 선택적·비신뢰
Blocklist(무조건 차단) / Mustlist(반드시 실행) / Allowlist(자동 허용) / Confirmation(사용자 확인) 4단계 강제 구조로 안전성과 에이전트 자율성 수준을 유연하게 조절 가능

Evidence

Alloy(1차 관계 논리 기반 형식 모델링 언어)로 보강된 MCP 프레임워크를 형식 검증 — 정책 없이는 private 데이터 유출 반례(counterexample)를 즉시 발견, 정책 적용 후 모든 unsafe flow 차단 수학적 확인
캘린더 에이전트 예시: STD 치료 예약 제목이 동료에게 보내는 일정 변경 이메일에 그대로 노출되는 시나리오를 4단계 강제 구조(list_events → send_email 경로 blocklist 또는 confirmation)로 결정론적 차단 시연
safe trace(이벤트 생성 → 일정 변경 → 참석자 알림, private 정보 제외)는 정책 적용 후에도 그대로 허용됨을 Alloy Analyzer가 확인 — 안전성 강화가 기능을 무너뜨리지 않음을 입증

How to Apply

MCP 서버의 각 툴 선언에 {'capabilities': 'external_write', 'confidentiality': 'private', 'trust_level': 'untrusted'} 형태의 키-값 태그를 추가하면, 외부 정책 엔진이 런타임에 툴 호출을 인터셉트해 private → external_write 흐름을 자동 차단 가능
에이전트 워크플로우 설계 시 STPA 4단계 적용: ① 직접·간접 이해관계자 파악 → ② 각 이해관계자의 손실(loss) 도출 → ③ 손실 유발 시스템 행동 분석 → ④ 안전 요구사항 정의 후 Blocklist/Mustlist/Allowlist/Confirmation 중 적절한 강제 수준 선택
send_email, write_file 같은 외부 쓰기 툴 앞에 인터셉터 미들웨어를 두고, 입력 데이터의 confidentiality 레이블이 'private'이면 blocklist, 'unsure'이면 사용자 confirmation 요청, 'public'이면 allowlist 처리하는 정책 레이어 구현

Code Example

snippet

# MCP 툴 선언 예시 — capability-enhanced 레이블 추가
{
  "name": "send_email",
  "description": "이메일 발송",
  "labels": {
    "capabilities": "external_write",
    "trust_level": "untrusted"
  },
  "inputSchema": {
    "to": {"type": "string"},
    "subject": {"type": "string", "labels": {"confidentiality": "public"}},
    "body": {"type": "string", "labels": {"confidentiality": "inferred"}}
  }
}

# 정책 엔진 인터셉터 pseudocode
def intercept_tool_call(tool_name, inputs, context_labels):
    for key, value in inputs.items():
        data_label = context_labels.get(key, {}).get("confidentiality")
        tool_label = TOOL_REGISTRY[tool_name]["labels"].get("capabilities")
        
        if data_label == "private" and tool_label == "external_write":
            raise BlockedByPolicy(f"{key} is private, cannot send via {tool_name}")
        elif data_label == "unsure" and tool_label == "external_write":
            return request_user_confirmation(tool_name, inputs)
    
    return execute_tool(tool_name, inputs)

Terminology

STPASystem-Theoretic Process Analysis. 항공기·자율주행차 같은 고위험 시스템에서 쓰는 안전 분석법. 부품 하나의 결함이 아니라 여러 부품이 상호작용할 때 생기는 사고를 미리 찾아내는 방법론.

IFCInformation Flow Control. 데이터가 시스템 안에서 어디로 흘러가는지 추적하고, 민감한 데이터가 허가받지 않은 곳으로 빠져나가지 못하도록 막는 기법. SQL 인젝션 방어에서 사용자 입력이 쿼리에 직접 닿지 못하게 막는 것과 비슷한 개념.

MCPModel Context Protocol. Anthropic이 만든 표준 프로토콜로, LLM 에이전트가 외부 툴(API, 데이터베이스, 파일 시스템 등)에 접근하는 방식을 통일. USB-C처럼 어떤 툴이든 같은 방식으로 꽂아 쓸 수 있게 해주는 규격.

AlloyMIT에서 만든 형식 검증(formal verification) 도구. 시스템 동작을 수학적 논리로 기술하면 가능한 모든 상태를 자동으로 탐색해 안전 규칙이 깨지는 반례를 찾아줌. '이 조건에서 버그가 절대 안 난다'를 증명하는 도구.

Information Flow데이터가 시스템 내 한 컴포넌트에서 다른 컴포넌트로 전달되는 경로. 에이전트에서는 툴 A의 출력이 LLM 컨텍스트를 거쳐 툴 B의 입력으로 흘러가는 것을 의미. 이 흐름이 통제되지 않으면 민감 정보가 예상 밖의 곳으로 새어나갈 수 있음.

Blocklist특정 행동을 무조건 금지하는 규칙 목록. 에이전트 문맥에서는 'private 데이터 → 외부 이메일 전송' 같은 흐름을 LLM 판단 없이 시스템 레벨에서 결정론적으로 차단.

형식 검증 (Formal Verification)수학적 논리를 사용해 시스템이 특정 속성을 항상 만족함을 증명하는 방법. 테스트는 '이 경우엔 안전했다'를 보이지만, 형식 검증은 '모든 경우에 안전하다'를 수학적으로 보장.

Related Resources

Original Abstract (Expand)

Large language model (LLM)-based AI agents extend LLM capabilities by enabling access to tools such as data sources, APIs, search engines, code sandboxes, and even other agents. While this empowers agents to perform complex tasks, LLMs may invoke unintended tool interactions and introduce risks, such as leaking sensitive data or overwriting critical records, which are unacceptable in enterprise contexts. Current approaches to mitigate these risks, such as model-based safeguards, enhance agents'reliability but cannot guarantee system safety. Methods like information flow control (IFC) and temporal constraints aim to provide guarantees but often require extensive human annotation. We propose a process that starts with applying System-Theoretic Process Analysis (STPA) to identify hazards in agent workflows, derive safety requirements, and formalize them as enforceable specifications on data flows and tool sequences. To enable this, we introduce a capability-enhanced Model Context Protocol (MCP) framework that requires structured labels on capabilities, confidentiality, and trust level. Together, these contributions aim to shift LLM-based agent safety from ad hoc reliability fixes to proactive guardrails with formal guarantees, while reducing dependence on user confirmation and making autonomy a deliberate design choice.