Architecting Secure AI Agents: Perspectives on System-Level Defenses Against Indirect Prompt Injection Attacks
TL;DR Highlight
To protect AI agents from malicious commands hidden in external data, you must co-design dynamic planning, LLM input restriction, and human intervention.
Who Should Read
Backend/ML engineers deploying LLM-based agents (coding assistants, email processing bots, etc.) to production. Developers concerned with security of agent systems that handle untrusted data such as external web pages, emails, and tool outputs.
Core Mechanics
- Indirect Prompt Injection (hiding malicious commands inside external data that agents read) is currently the largest agent security threat, and existing defenses each have their own limitations
- Plan-execution isolation (locking the plan during execution) is impractical in real-world scenarios like API version changes or runtime errors, as it causes the agent to halt completely
- When using an LLM for security judgment, avoid passing the entire raw environment text — instead pass only structured diffs/traces to narrow the scope of judgment and reduce the attack surface
- A proposed two-stage separation approach: first have the LLM verbalize the instruction it intends to follow, then track its source (trusted vs. untrusted) and let system-level code decide whether to execute it
- Another valid approach: have the LLM auto-generate a step-specific programming validator for each execution step to verify environment responses (e.g., generating a regex that extracts only a number from a specific DOM path)
- Existing benchmarks like AgentDojo use only static attack payloads and include almost no complex tasks requiring replanning, leading to overly optimistic evaluations of defense performance
Evidence
- "Among 97 tasks in the AgentDojo benchmark, only 6 (6%) require replanning or policy updates — failing to reflect real-world scenarios. All existing benchmarks (AgentDojo, InjecAgent, ASB, etc.) use only static attack payloads, with no RL-based adaptive attackers or genetic algorithm-optimized attacks included. Task-agnostic global policies like CaMeL's send_money policy produce false positives, blocking legitimate cases such as reading recipient info from an emailed invoice. Progent's architecture passes unlimited environment feedback to the LLM to update policies, making it vulnerable to adaptive attacks that directly target the policy adjuster LLM."
How to Apply
- "Before an agent executes an instruction read from an external source (web page, email, etc.), implement a two-stage pipeline: have the LLM explicitly repeat the command it intends to follow, verify whether its source is trusted, and then let system code decide to block or allow execution. Instead of passing raw environment responses to the LLM executor, have the LLM generate a validator for the expected response format at the start of each step, then pass only the filtered, structured result through that validator (e.g., a regex + explicit DOM path that extracts only the Q4 revenue figure). When the agent's plan needs to change, pass only a structured JSON containing the diff before and after the change and a summary of execution history to the LLM security judgment model to assess whether the change is contextually reasonable given the original task — never include the original environment text."
Code Example
Terminology
Related Resources
- CaMeL: Defeating prompt injections by design
- AgentDojo: Dynamic environment to evaluate prompt injection
- Progent: Programmable privilege control for LLM agents
- AgentDyn: Dynamic open-ended benchmark for agent security
- Instruction-following intent analysis (author's prior work)
- DRIFT: Dynamic rule-based defense with injection isolation
Original Abstract (Expand)
AI agents, predominantly powered by large language models (LLMs), are vulnerable to indirect prompt injection, in which malicious instructions embedded in untrusted data can trigger dangerous agent actions. This position paper discusses our vision for system-level defenses against indirect prompt injection attacks. We articulate three positions: (1) dynamic replanning and security policy updates are often necessary for dynamic tasks and realistic environments; (2) certain context-dependent security decisions would still require LLMs (or other learned models), but should only be made within system designs that strictly constrain what the model can observe and decide; (3) in inherently ambiguous cases, personalization and human interaction should be treated as core design considerations. In addition to our main positions, we discuss limitations of existing benchmarks that can create a false sense of utility and security. We also highlight the value of system-level defenses, which serve as the skeleton of agentic systems by structuring and controlling agent behaviors, integrating rule-based and model-based security checks, and enabling more targeted research on model robustness and human interaction.