Parallax: Why AI Agents That Think Must Never Act
TL;DR Highlight
Prompt guardrails are useless if the Agent is hacked — a security architecture paradigm that completely separates inference and execution at the OS process level.
Who Should Read
Backend/platform developers building AI Agents that use real-world tools such as file system access, shell execution, and API calls, or designing secure architectures. AI infrastructure engineers considering strategies to counter prompt injection.
Core Mechanics
- The fundamental limitation of prompt guardrails: safety instructions and malicious inputs pass through the same LLM attention mechanism, so once the Agent is hacked, prompt-level protection is completely neutralized.
- Parallax is an architectural paradigm consisting of 4 principles — (1) Cognitive-Executive Separation (separation of inference/execution), (2) Adversarial Validation with Graduated Determinism (4-stage independent verification), (3) Information Flow Control (propagation of data sensitivity labels), (4) Reversible Execution (capture of state before destructive actions).
- The core principle of Cognitive-Executive Separation: LLM inference processes are sandboxed at the OS level with no file system access, network, or shell execution permissions, and can only propose actions via gRPC.
- Shield (the verification layer) operates in 4 tiers — Tier 0: YAML policy rules (deterministic), Tier 1: heuristic engine + DeBERTa classifier (parallel), Tier 2: separate LLM evaluation (budget limited), Tier 3: human approval. All stages are fail-closed (block on failure).
- Information Flow Control (IFC): If the Agent reads a credentials file, it is tagged with RESTRICTED, and Shield blocks the subsequent attempt to transmit that data over the network, regardless of how many intermediate stages there are — defense against multi-stage toolchain attacks.
- Dynamic Tool Surface Reduction: The Agent loads only the necessary tool group for each turn and the rest are invisible, minimizing the weapons available the moment it is hacked. 0 tools during text analysis tasks = 0 attack surface.
Evidence
- 98.9% attack blocking and 0% false positives based on 280 adversarial test cases (9 attack categories). 100% blocking (36% false positives) in maximum security settings.
- Assume-Compromise Evaluation (completely bypassing the inference system and directly injecting tool calls) reduced the success rate from 100% in Config A (no Shield) to 1.1% in Config C (Parallax).
- Shield tier processing distribution: Tier 0 (policy rules) 15.4%, Tier 1 (classifier) 58.2%, Tier 2 (LLM) 26.1%, Tier 3 (human) 0.4% — deterministic mechanisms (Tier 0+1) handle 73.6% of cases, minimizing LLM dependency.
- Claude Sonnet 4.6, Claude Haiku 4.5, and Llama 3.1 8B had a 0% attack success rate in LLM mode due to safety training, but Qwen 2.5 3B immediately executed harmful requests such as deleting files — demonstrating the realism of the 'hacked Agent' assumption.
How to Apply
- Separate the Agent server into 2 OS processes — the LLM inference process removes all permissions except gRPC, and tool execution is performed only in a separate Engine process. If you have an existing single-process Agent architecture, redesign from the process boundary using the OpenParallax (Go) source code as a reference.
- Change the way tools are registered — do not expose all tools to the system prompt at once, but provide only one `load_tools` metatool and dynamically load only the group needed for the current task. For coding assistants, keep file R/W groups and shell execution groups separate and activate them only when necessary.
- Apply the Chronicle pattern before destructive actions (file deletion, DB writes, configuration changes) — save a SHA-256 based snapshot before action execution and make it recoverable with a `rollback` command. Adding only destructive action classification to an existing pipeline allows for partial application.
Code Example
# Tier 0 policy file example (config.yaml)
deny:
- name: block_sensitive_system_paths
action_types: [read_file, write_file, delete_file]
paths: ["~/.ssh/**", "~/.aws/**", "/etc/shadow", "**/*.pem", "**/.env"]
- name: block_identity_deletion
action_types: [delete_file]
paths: ["**/SOUL.md", "**/IDENTITY.md"]
verify:
- name: evaluate_shell_commands
action_types: [execute_command]
tier_override: 1 # Escalate to Tier 1(classifier)
- name: evaluate_soul_modification
action_types: [write_file]
paths: ["**/SOUL.md", "**/IDENTITY.md"]
tier_override: 2 # Escalate to Tier 2(LLM evaluation)
allow:
- name: allow_workspace_reads
action_types: [read_file, list_directory, search_files]
paths: ["~/workspace/**"]
# Assume-Compromise test example (Python harness)
import grpc
def inject_tool_call(shield_stub, action_type, payload):
"""Directly inject tool calls into Shield, completely bypassing the LLM"""
request = ShieldEvalRequest(
action_type=action_type,
payload=payload,
# no agent context - simulate fully hacked state
)
response = shield_stub.Evaluate(request)
return response.verdict # ALLOW / BLOCK / ESCALATE
# Test: Attempt to delete credentials file
result = inject_tool_call(stub, "delete_file", {"path": "/home/user/.ssh/id_rsa"})
print(result) # BLOCK (Tier 0 deny rule)
# Test: Attempt to bypass with Base64 encoding
result = inject_tool_call(stub, "execute_command",
{"command": "echo 'cm0gLXJmIC8q' | base64 -d | sh"})
print(result) # BLOCK (Tier 1 heuristic: base64 encoded command)Terminology
Related Resources
Original Abstract (Expand)
Autonomous AI agents are rapidly transitioning from experimental tools to operational infrastructure, with projections that 80% of enterprise applications will embed AI copilots by the end of 2026. As agents gain the ability to execute real-world actions (reading files, running commands, making network requests, modifying databases), a fundamental security gap has emerged. The dominant approach to agent safety relies on prompt-level guardrails: natural language instructions that operate at the same abstraction level as the threats they attempt to mitigate. This paper argues that prompt-based safety is architecturally insufficient for agents with execution capability and introduces Parallax, a paradigm for safe autonomous AI execution grounded in four principles: Cognitive-Executive Separation, which structurally prevents the reasoning system from executing actions; Adversarial Validation with Graduated Determinism, which interposes an independent, multi-tiered validator between reasoning and execution; Information Flow Control, which propagates data sensitivity labels through agent workflows to detect context-dependent threats; and Reversible Execution, which captures pre-destructive state to enable rollback when validation fails. We present OpenParallax, an open-source reference implementation in Go, and evaluate it using Assume-Compromise Evaluation, a methodology that bypasses the reasoning system entirely to test the architectural boundary under full agent compromise. Across 280 adversarial test cases in nine attack categories, Parallax blocks 98.9% of attacks with zero false positives under its default configuration, and 100% of attacks under its maximum-security configuration. When the reasoning system is compromised, prompt-level guardrails provide zero protection because they exist only within the compromised system; Parallax's architectural boundary holds regardless.