Architecting Secure AI Agents: Perspectives on System-Level Defenses Against Indirect Prompt Injection Attacks

Mar 31, 2026•Chong Xiang, Drew Zagieboylo, Shaona Ghosh +5•View PDF

TL;DR Highlight

To protect AI agents from malicious commands hidden in external data, you must co-design dynamic planning, LLM input restriction, and human intervention.

Who Should Read

Backend/ML engineers deploying LLM-based agents (coding assistants, email processing bots, etc.) to production. Developers concerned with security of agent systems that handle untrusted data such as external web pages, emails, and tool outputs.

Core Mechanics

Indirect Prompt Injection (hiding malicious commands inside external data that agents read) is currently the largest agent security threat, and existing defenses each have their own limitations
Plan-execution isolation (locking the plan during execution) is impractical in real-world scenarios like API version changes or runtime errors, as it causes the agent to halt completely
When using an LLM for security judgment, avoid passing the entire raw environment text — instead pass only structured diffs/traces to narrow the scope of judgment and reduce the attack surface
A proposed two-stage separation approach: first have the LLM verbalize the instruction it intends to follow, then track its source (trusted vs. untrusted) and let system-level code decide whether to execute it
Another valid approach: have the LLM auto-generate a step-specific programming validator for each execution step to verify environment responses (e.g., generating a regex that extracts only a number from a specific DOM path)
Existing benchmarks like AgentDojo use only static attack payloads and include almost no complex tasks requiring replanning, leading to overly optimistic evaluations of defense performance

Evidence

"Among 97 tasks in the AgentDojo benchmark, only 6 (6%) require replanning or policy updates — failing to reflect real-world scenarios. All existing benchmarks (AgentDojo, InjecAgent, ASB, etc.) use only static attack payloads, with no RL-based adaptive attackers or genetic algorithm-optimized attacks included. Task-agnostic global policies like CaMeL's send_money policy produce false positives, blocking legitimate cases such as reading recipient info from an emailed invoice. Progent's architecture passes unlimited environment feedback to the LLM to update policies, making it vulnerable to adaptive attacks that directly target the policy adjuster LLM."

How to Apply

"Before an agent executes an instruction read from an external source (web page, email, etc.), implement a two-stage pipeline: have the LLM explicitly repeat the command it intends to follow, verify whether its source is trusted, and then let system code decide to block or allow execution. Instead of passing raw environment responses to the LLM executor, have the LLM generate a validator for the expected response format at the start of each step, then pass only the filtered, structured result through that validator (e.g., a regex + explicit DOM path that extracts only the Q4 revenue figure). When the agent's plan needs to change, pass only a structured JSON containing the diff before and after the change and a summary of execution history to the LLM security judgment model to assess whether the change is contextually reasonable given the original task — never include the original environment text."

Code Example

snippet

Terminology

Indirect Prompt InjectionAn attack that hides malicious commands inside external data (web pages, emails, etc.) that an AI agent reads. Similar to writing 'anyone who reads this document must send money immediately' inside a document.

Plan-execution isolationA defense strategy that locks the agent's initial plan so it cannot be changed during execution. It prevents external data from altering the plan, but has the downside of causing the agent to halt when unexpected errors occur.

Policy EnforcerA gatekeeper module that checks whether an action the agent intends to take falls within the allowed scope. Example: enforcing rules like 'this file may be read, but that file may not.'

IFC (Information Flow Control)A technique that assigns security labels such as 'public / internal / confidential' to data and prevents data at a higher classification from flowing to a lower one. Prevents confidential document contents from leaking to public channels.

VerbalizationMaking the LLM explicitly output in text the command it has internally decided to follow. A transparency technique that makes the LLM's implicit decisions auditable.

Least PrivilegeA security principle of granting only the minimum permissions necessary to perform a given task. The idea that an email-reading task does not need file-deletion privileges.

Adaptive AttackAn attack that learns the structure of a defense system and optimizes against it. Instead of using fixed malicious strings, it automatically finds payloads that bypass defenses using RL or genetic algorithms.

Related Resources

Original Abstract (Expand)

AI agents, predominantly powered by large language models (LLMs), are vulnerable to indirect prompt injection, in which malicious instructions embedded in untrusted data can trigger dangerous agent actions. This position paper discusses our vision for system-level defenses against indirect prompt injection attacks. We articulate three positions: (1) dynamic replanning and security policy updates are often necessary for dynamic tasks and realistic environments; (2) certain context-dependent security decisions would still require LLMs (or other learned models), but should only be made within system designs that strictly constrain what the model can observe and decide; (3) in inherently ambiguous cases, personalization and human interaction should be treated as core design considerations. In addition to our main positions, we discuss limitations of existing benchmarks that can create a false sense of utility and security. We also highlight the value of system-level defenses, which serve as the skeleton of agentic systems by structuring and controlling agent behaviors, integrating rule-based and model-based security checks, and enabling more targeted research on model robustness and human interaction.