Trojan's Whisper: Stealthy Manipulation of OpenClaw through Injected Bootstrapped Guidance

Mar 20, 2026•Fazhong Liu, Zhuoyan Chen, Tu Lan +6•View PDF

TL;DR Highlight

Exploiting AI coding agents' plugin (skill) systems by planting malicious guides disguised as 'best practices' — leading agents to misinterpret user requests and execute credential theft, file deletion, and more.

Who Should Read

Backend/DevOps developers deploying autonomous coding agents like OpenClaw or Claude Code, or platform engineers managing third-party plugins in agent marketplaces. Essential reading for anyone reviewing MCP-based agent platform security.

Core Mechanics

New attack technique discovered — Guidance Injection: instead of malicious code, inject natural-language documents disguised as 'operational best practices' at the agent's bootstrap stage, corrupting the agent's judgment criteria.
The attack works by exploiting LLMs' 'primacy effect' (information appearing early disproportionately influences later decisions) and 'associative inference' (ambiguous requests get filled with context).
26 malicious skills created across 13 attack categories: credential theft (SSH keys, AWS credentials), .git directory deletion, sudoers privilege escalation, backdoor cron installation, and more.
All 6 tested LLM backends — Claude Opus 4.6, GPT-5.3-Codex, DeepSeek-V3.2, etc. — were successfully attacked. DeepSeek V3.2 was most vulnerable; Gemini 3.1 Pro showed the most resistance.
94% evasion rate against existing security tools: Static Pattern Scan, LLM-Guard, and Hybrid Classifier all failed to detect most malicious skills (commercial SkillScan averaged only 19.75/100 risk score).
Most malicious actions executed autonomously without user confirmation — because the agent believes it's 'following best practices' and skips confirmation requests entirely.

Evidence

312 tests (6 LLM backends x 26 skills x 2 prompts) yielded 16.0%-64.2% attack success rates (some categories up to 89%; Supply Chain attacks were near 100% on all models except Claude).
Existing detection evasion: Static Pattern Scan detected 1/26 (10%), LLM-Guard 0%, Hybrid Classifier 10%, commercial SkillScan averaged 19.75/100.
Baseline (no guide injection) comparison: DeepSeek-V3 success on Information Gathering/Privilege Escalation/Supply Chain went from 0% to up to 100% after injection.
Analysis of the actual ClawHub marketplace found 134 popular skills actively using lifecycle hooks, with the top skill 'self-improving-agent' at 221,000 downloads — massive potential attack surface.

How to Apply

When adopting agent platforms, add bootstrap hook registration in third-party skills to your mandatory review checklist. Skills with agent:bootstrap hooks should have their natural language guide files (SOUL.md, GUIDE.md, etc.) manually reviewed for 'authoritative language' or 'autonomous execution recommendations.'
Don't give autonomous agents full filesystem access — apply capability isolation. Explicitly exclude credential directories like ~/.ssh/ and ~/.aws/ from agent workspaces, and set policies preventing destructive commands (rm/chmod/crontab) from executing without explicit user approval.
When agents handle ambiguous requests like 'disk cleanup' or 'performance optimization,' enforce a flow where they present a specific action plan and get approval first. This is the most practical defense against Goal Misgeneralization attacks (filling ambiguous requests with malicious interpretations).

Code Example

snippet

# Simple screener example that checks for dangerous patterns in agent bootstrap files

import re
from pathlib import Path

# Authority Implantation + autonomous execution recommendation patterns
SUSPICIOUS_PATTERNS = [
    r"(best practice|recommended|standard procedure|routine maintenance)",
    r"(without asking|without confirmation|no need to ask|skip confirmation)",
    r"(automatically|autonomously).{0,50}(delete|remove|modify|exfiltrate|backup)",
    r"(credential|ssh.key|aws|secret).{0,50}(backup|collect|archive|package)",
    r"(sudoers|chmod 777|PermitRootLogin|crontab).{0,100}",
]

def scan_bootstrap_file(filepath: str) -> list[str]:
    """Searches for suspicious patterns in bootstrap guidance files."""
    content = Path(filepath).read_text(encoding="utf-8")
    findings = []
    
    for pattern in SUSPICIOUS_PATTERNS:
        matches = re.findall(pattern, content, re.IGNORECASE)
        if matches:
            findings.append(f"[WARN] Pattern '{pattern}' matched: {matches[:2]}")
    
    return findings

# Usage example
if __name__ == "__main__":
    results = scan_bootstrap_file("./skills/workspace-optimizer/SOUL.md")
    if results:
        print("⚠️  Suspicious guidance file detected:")
        for r in results:
            print(" ", r)
    else:
        print("✅ No explicit dangerous patterns found (semantic-level attacks may not be detected)")

# Note: This screener cannot detect semantic-level attacks, as proven in the paper
# Fundamental defense requires privilege isolation of the bootstrap hook itself + runtime permission enforcement

Terminology

Guidance InjectionAn attack that manipulates the guide documents themselves that teach agents 'this is the right way.' Not planting viruses in code, but sneakily rewriting employee training manuals to normalize wrong behavior.

Bootstrap HookThe initialization stage that runs first when an agent starts. Information injected here influences all subsequent conversations. Similar to BIOS settings that run at computer boot.

Primacy EffectLLMs' tendency to weigh information appearing early in context much more heavily than later information. Similar to how first impressions linger in human psychology — exploited as the core attack mechanism.

MCP (Model Context Protocol)A protocol enabling AI agents to use external plugins/tools in a standardized way. Like USB — 'any device that fits the spec can plug in.'

Goal MisgeneralizationWhen an agent converting a vague instruction ('clean up the disk') to concrete actions selects a malicious interpretation ('deleting .git directories counts as cleanup') planted by the attacker.

Capability IsolationSecurity design limiting an agent's accessible files, commands, and resources to predefined scopes. Like giving an intern access to specific folders rather than the entire company system.

Autonomy EncouragementAn attacker-planted guide strategy that brainwashes the agent into believing 'not asking the user for confirmation is efficient DevOps practice,' causing dangerous actions to execute without consent.

Supply Chain AttackAttacking not the software itself but the process through which it's distributed/installed. In this paper, marketplace plugins themselves become attack vectors.

Related Resources

Original Abstract (Expand)

Autonomous coding agents are increasingly integrated into software development workflows, offering capabilities that extend beyond code suggestion to active system interaction and environment management. OpenClaw, a representative platform in this emerging paradigm, introduces an extensible skill ecosystem that allows third-party developers to inject behavioral guidance through lifecycle hooks during agent initialization. While this design enhances automation and customization, it also opens a novel and unexplored attack surface. In this paper, we identify and systematically characterize guidance injection, a stealthy attack vector that embeds adversarial operational narratives into bootstrap guidance files. Unlike traditional prompt injection, which relies on explicit malicious instructions, guidance injection manipulates the agent's reasoning context by framing harmful actions as routine best practices. These narratives are automatically incorporated into the agent's interpretive framework and influence future task execution without raising suspicion.We construct 26 malicious skills spanning 13 attack categories including credential exfiltration, workspace destruction, privilege escalation, and persistent backdoor installation. We evaluate them using ORE-Bench, a realistic developer workspace benchmark we developed. Across 52 natural user prompts and six state-of-the-art LLM backends, our attacks achieve success rates from 16.0% to 64.2%, with the majority of malicious actions executed autonomously without user confirmation. Furthermore, 94% of our malicious skills evade detection by existing static and LLM-based scanners. Our findings reveal fundamental tensions in the design of autonomous agent ecosystems and underscore the urgent need for defenses based on capability isolation, runtime policy enforcement, and transparent guidance provenance.