AgentWard: A Lifecycle Security Architecture for Autonomous AI Agents
TL;DR Highlight
AI Defenses systematically designs security layers across the AI lifecycle to mitigate risks.
Who Should Read
Backend/infrastructure developers deploying LLM-powered autonomous agents to production. AI system designers actively considering agent security threats like Prompt Injection, memory corruption, and malicious plugins.
Core Mechanics
- Agent security threats propagate sequentially—initialization → input → memory → decision-making → execution—and aren’t solved by a single point of defense like input filtering.
- Five protection layers comprise the architecture: Foundation Scan (supply chain), Input Sanitization, Cognition Protection (memory), Decision Alignment, and Execution Control. Each layer operates on a different security principle to prevent common bypass patterns.
- A zero-trust principle is applied—even if an upstream layer ‘allows’ access, downstream layers independently re-verify. The architecture assumes upstream components are already compromised.
- Cross-layer coordination transmits ‘ambiguous’ signals from one layer to the next for cumulative risk assessment. Weak signals accumulate to automatically trigger stricter execution policies.
- A malicious skill scenario: Foundation Scan detects a mismatch between skill description and code → Decision Alignment detects an unauthorized plan → Execution Control blocks file access. This illustrates the interplay of three layers.
- An Indirect Prompt Injection → memory backdoor scenario: Cognition Protection blocks a malicious command injected via a webpage from being stored in MEMORY.md, preventing the memory from becoming a relay point for future attacks.
Evidence
- "The architecture was demonstrated by implementing a plugin-native prototype on top of the OpenClaw agent, successfully blocking attacks in two multi-stage attack chains (malicious skill → data exfiltration, Indirect Prompt Injection → persistent backdoor + DoS) through inter-layer cooperation."
How to Apply
- Classify runtime events in your agent system into five stages (initialization/input/memory/decision/execution) and add independent validation hooks to each stage. Start by inserting a command pattern check layer immediately before tool calls.
- If your agent stores external documents or web search results in memory (files/DB), add a Cognition Protection layer before storage to inspect for Prompt Injection patterns and content anomalies, preventing persistent backdoors.
- Maintain security assessment results in a shared security state and pass them to subsequent layers. Implement a cumulative escalation pattern where a ‘suspicious but not blockable’ assessment in one layer triggers stricter policies for high-risk actions.
Code Example
# OpenClaw plugin style - AgentWard layer hook attachment example
class AgentWardPlugin:
def __init__(self):
self.session_risk_state = {"risk_score": 0, "warnings": []}
# Foundation Scan: Check before skill loading
def before_prompt_build(self, context):
for skill in context.loaded_skills:
if self._detect_skill_mismatch(skill):
self.session_risk_state["warnings"].append({
"layer": "foundation_scan",
"skill": skill.name,
"finding": "description_code_mismatch"
})
self.session_risk_state["risk_score"] += 30
# Input Sanitization: Check when external content is input
def before_message_write(self, message):
if message.role == "tool":
if self._detect_prompt_injection(message.content):
message.content = self._sanitize(message.content)
self.session_risk_state["risk_score"] += 20
self.session_risk_state["warnings"].append({
"layer": "input_sanitization",
"action": "sanitized"
})
# Cognition Protection: Check when modifying memory files
# Execution Control: Monitor all tool calls
def before_tool_call(self, tool_name, params, is_memory_write=False):
if is_memory_write:
# Cognition Protection
if self._detect_malicious_memory_pattern(params):
return {"block": True, "reason": "suspicious_memory_mutation"}
# Execution Control: Strengthen policy based on cumulative risk
if self.session_risk_state["risk_score"] > 40:
if self._is_high_risk_command(tool_name, params):
return {"block": True, "reason": "high_risk_under_elevated_session_risk"}
return {"block": False}
def _detect_skill_mismatch(self, skill): ...
def _detect_prompt_injection(self, content): ...
def _sanitize(self, content): ...
def _detect_malicious_memory_pattern(self, params): ...
def _is_high_risk_command(self, tool_name, params): ...Terminology
Related Papers
Ramp's Sheets AI Exfiltrates Financials
Ramp's spreadsheet AI agent succumbed to a hidden prompt injection within an external dataset, automatically inserting malicious formulas and exfiltrating confidential financial data to an external server.
Letting AI play my game – building an agentic test harness to help play-testing
IndieGameAgent automatically playtests games using an LLM, solving a QA bottleneck for solo developers.
Tendril – a self-extending agent that builds and registers its own tools
Tendril demonstrates a self-extending AI agent pattern by dynamically writing and registering tools when needed, creating a growing repository of capabilities with each session.
Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview
Dirac cuts API costs 64.8% and achieves 65.2% on TerminalBench-2 with efficient context management.
EvanFlow – A TDD driven feedback loop for Claude Code
EvanFlow automates code brainstorming, TDD, and validation in Claude Code with 16 skills triggered by a single prompt.
An AI agent deleted our production database. The agent's confession is below
Cursor AI Agent가 Railway 프로덕션 데이터베이스와 백업까지 통째로 삭제한 사고 사례로, AI Agent에 과도한 권한을 줄 때의 위험성과 엔지니어링 통제의 중요성을 보여준다.
Related Resources
Original Abstract (Expand)
Autonomous AI agents extend large language models into full runtime systems that load skills, ingest external content, maintain memory, plan multi-step actions, and invoke privileged tools. In such systems, security failures rarely remain confined to a single interface; instead, they can propagate across initialization, input processing, memory, decision-making, and execution, often becoming apparent only when harmful effects materialize in the environment. This paper presents AgentWard, a lifecycle-oriented, defense-in-depth architecture that systematically organizes protection across these five stages. AgentWard integrates stage-specific, heterogeneous controls with cross-layer coordination, enabling threats to be intercepted along their propagation paths while safeguarding critical assets. We detail the design rationale and architecture of five coordinated protection layers, and implement a plugin-native prototype on OpenClaw to demonstrate practical feasibility. This perspective provides a concrete blueprint for structuring runtime security controls, managing trust propagation, and enforcing execution containment in autonomous AI agents. Our code is available at https://github.com/FIND-Lab/AgentWard .