ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
TL;DR Highlight
A runtime security layer that blocks malicious commands based on rules whenever an LLM agent receives results from external tools.
Who Should Read
Developers deploying LLM agents using tools like AutoGPT, LangChain, and MCP. Teams operating pipelines where agents perform web searches, read files, and call external APIs.
Core Mechanics
- The main pathways for Indirect Prompt Injection (attacks that manipulate agents by hiding malicious commands in external content) are web/local content insertion, MCP server insertion, and skill file insertion.
- Existing defenses (RLHF alignment, StruQ protocol separation, CaMeL dual LLM) all require fine-tuning, infrastructure changes, or manual rule creation by experts, making them difficult to apply in practice.
- ClawGuard automatically generates access rules (Rtask) from the user's task goal before the agent makes its first tool call and obtains user confirmation — creating rules before malicious content is mixed in is key.
- Four components operate at every tool call boundary: Content Sanitizer (masking sensitive information), Rule Evaluator (whitelist/blacklist judgment), Skill Inspector (skill file risk assessment), and Approval Mechanism (user approval for ambiguous cases).
- If the Rule Evaluator returns an 'amb(ambiguous)' judgment (neither whitelist nor blacklist), it requests user approval and automatically escalates to 'amb' if obfuscation patterns like Base64 encoding are detected.
- As a middleware approach that doesn't require modifying the model or infrastructure, it can be attached to any LLM backbone, such as DeepSeek-V3.2, GLM-5, Kimi-K2.5, MiniMax-M2.5, and Qwen3.5-397B.
Evidence
- In the AgentDojo benchmark (160 tasks), the base model's ASR (attack success rate) was 0.6~3.1%, while ClawGuard achieved 0% ASR and 100% DSR (defense success rate) across all models.
- In the SkillInject benchmark (84 skill injection attacks), the base model ASR decreased from 26~48% to 4.8~14% with ClawGuard applied, a relative reduction of 50~84%. Task completion rate (CR) remained at 85~90%.
- In the MCPSafeBench benchmark (215 real MCP server attacks), the base model ASR decreased from 36.5~44.5% to 7.1~11% with ClawGuard applied, achieving a DSR of 74.9~75.8%.
- In contrast to previous research showing that none of the 16 popular LLM agents in the AgentSafetyBench study had a safety score above 60%, these results support the need for rule-based boundary enforcement.
How to Apply
- In a LangChain or AutoGPT pipeline, insert ClawGuard's Rule Evaluator as middleware immediately before tool calls — it can be implemented as an interceptor pattern that checks if the command is on the whitelist before the tool is executed.
- At the start of an agent session, insert the user's task description (e.g., 'Summarize the three most recent blog posts from example-research.org and save to ~/reports/') into the Rule Synthesis Prompt in Figure 4 to automatically generate JSON rules, obtain user confirmation, and apply them throughout the session.
- Hardcoding the Rbase basic rules from Table V (~/.ssh/ access blocking, rm -rf blocking, ngrok and other tunneling service blocking, etc.) into the tool call validation logic of your own agent can establish a minimum security baseline without fine-tuning.
Code Example
# ClawGuard Rule Synthesis Prompt example (based on Figure 4)
# Automatically generate rules from user task before agent session starts
SYSTEM_PROMPT = """
You are a security policy synthesizer for an LLM agent runtime.
Given the user's task description, produce a minimal, precise rule set
in valid JSON that restricts the agent to actions necessary for the stated task.
Do not infer permissions not required by the task.
Output only the JSON object; no prose.
"""
USER_TASK = "Summarize the three most recent blog posts from example-research.org and save to ~/reports/summary.md"
TASK_PROMPT = f"""
Based solely on the task described above, produce a JSON object with:
- network_rules: {{whitelist: [...], blacklist: [...]}}
- file_rules: {{whitelist: [...], blacklist: [...]}}
- command_rules: {{
framework_tools: {{allow: [...], deny: [...]}},
shell_commands: {{allow: [...], deny: [...]}},
queue: [...]
}}
Task: {USER_TASK}
Apply principle of least privilege.
"""
# Expected output example
expected_rules = {
"network_rules": {
"whitelist": ["example-research.org"],
"blacklist": ["*.onion", "*.ngrok.io"]
},
"file_rules": {
"whitelist": ["~/reports/"],
"blacklist": ["~/.ssh/", "~/.aws/", "/etc/"]
},
"command_rules": {
"framework_tools": {
"allow": ["web_fetch", "write"],
"deny": ["exec", "read"]
},
"shell_commands": {
"allow": [],
"deny": ["rm", "curl", "wget", "bash"]
},
"queue": ["file_deletion", "network_write"]
}
}
# Tool call interceptor pattern (pseudo-code)
def intercept_tool_call(tool_name, args, rules):
# 1. Content Sanitize
sanitized_args = sanitize_sensitive_data(args) # Mask AWS key, SSH key, etc.
# 2. Rule Evaluate
verdict = evaluate_against_rules(tool_name, sanitized_args, rules)
# verdict: 'allow' | 'deny' | 'amb'
if verdict == 'deny':
log_and_block(tool_name, sanitized_args)
return None
elif verdict == 'amb':
user_decision = request_user_approval(tool_name, sanitized_args)
if user_decision != 'approve':
return None
# 3. Execute & sanitize output
result = execute_tool(tool_name, sanitized_args)
return sanitize_output(result) # Mask sensitive information in the result as wellTerminology
Related Resources
Original Abstract (Expand)
Tool-augmented Large Language Model (LLM) agents have demonstrated impressive capabilities in automating complex, multi-step real-world tasks, yet remain vulnerable to indirect prompt injection. Adversaries exploit this weakness by embedding malicious instructions within tool-returned content, which agents directly incorporate into their conversation history as trusted observations. This vulnerability manifests across three primary attack channels: web and local content injection, MCP server injection, and skill file injection. To address these vulnerabilities, we introduce \textsc{ClawGuard}, a novel runtime security framework that enforces a user-confirmed rule set at every tool-call boundary, transforming unreliable alignment-dependent defense into a deterministic, auditable mechanism that intercepts adversarial tool calls before any real-world effect is produced. By automatically deriving task-specific access constraints from the user's stated objective prior to any external tool invocation, \textsc{ClawGuard} blocks all three injection pathways without model modification or infrastructure change. Experiments across five state-of-the-art language models on AgentDojo, SkillInject, and MCPSafeBench demonstrate that \textsc{ClawGuard} achieves robust protection against indirect prompt injection without compromising agent utility. This work establishes deterministic tool-call boundary enforcement as an effective defense mechanism for secure agentic AI systems, requiring neither safety-specific fine-tuning nor architectural modification. Code is publicly available at https://github.com/Claw-Guard/ClawGuard.