Natural-Language Agent Harnesses
TL;DR Highlight
A framework that writes and shares agent control logic (harness) in natural language instead of code, executed by a shared runtime, enabling comparison, reuse, and analysis of design patterns.
Who Should Read
Backend and AI engineers building complex coding/automation agents with multi-agent frameworks like LangChain and AutoGen. Developers who want to systematically design, and reuse step-by-step control logic for agents.
Core Mechanics
- Existing agent harnesses (control stacks) are buried inside controller code, making them hard to port, compare, or analyze — this paper proposes NLAH (Natural-Language Agent Harness), a format that externalizes them in natural language
- IHR (Intelligent Harness Runtime) places an LLM inside the runtime loop to directly interpret and execute natural-language harnesses, separating shared runtime policy from per-task harness logic
- On the OSWorld benchmark, migrating from a code harness (30.4%) to a natural-language NLAH (47.2%) actually improved performance — attributed to a strategy shift from GUI repair loops to file/shell-based verification
- In module combination experiments (RQ2), the self-evolution module was most effective on SWE-bench: 75.2% → 80.0% (+4.8%p). The assumption that 'more structure is always better' was wrong — verifier and multi-candidate search actually degraded performance
- The file-backed state module (externalizing state to files) improved OSWorld from 41.7% → 47.2% (+5.5%p). Durability — maintaining state even when context is truncated — is the key benefit
- When running the TRAE harness, approximately 90% of total tokens and tool calls originated from delegated child agents, not the parent thread — evidence that multi-agent delegation works in practice
Evidence
- "OSWorld: code harness 30.4% vs. NLAH 47.2% (same task family, GPT-5.4 + IHR baseline); SWE-bench Verified: adding self-evolution module 75.2% → 80.0% (+4.8%p), adding file-backed state 75.2% → 76.8% (+1.6%p); TRAE Full IHR: 91.5% of prompt tokens and 90.2% of tool calls originated from child agents (parent thread used only ~9–10%); In Full IHR vs. ablation comparison across 125 SWE-bench samples, over 110 converged to identical results — performance differences concentrated in a small number of edge cases."
How to Apply
- "Migrate existing agent pipelines (plan→execute→verify→repair) written in Python code into natural-language NLAH format (YAML/Markdown), explicitly defining stages, roles, contracts, and failure taxonomy so the runtime can interpret and execute them; If long-running agents suffer from context loss, apply the file-backed state module pattern (managing task_state.json, manifest.json, and RESPONSE.md as path-addressable files) to externalize state to files; In multi-agent systems, don't assume that adding a verifier or multi-candidate search always helps — modules that keep the retry loop tight, like self-evolution, can be more efficient, so add modules one at a time and run ablation tests."
Code Example
# NLAH format example (natural-language harness specification)
Task: Implement a function and ensure it passes tests.
Roles:
- Planner: Analyze task and create a plan
- Solver: Responsible for code implementation
- Debugger: Responsible for fixes upon failure
Stages:
1. PLAN
Role: Planner
Output: task_plan.md
Contract: Include implementation strategy and list of edge cases
2. EXECUTE
Role: Solver
Output: solution.py
Contract: Valid Python code, no syntax errors
3. VERIFY
Action: run_tests(solution.py)
If passed → STOP
If failed → REPAIR
4. REPAIR
Role: Debugger
Input: failing_code + error_message
Action: fix and overwrite solution.py
Retry: VERIFY (max 3 attempts)
File-backed State:
- state/task_history.jsonl # append-only execution log
- artifacts/manifest.json # artifact index
- children/*/RESPONSE.md # child agent responses
Failure Taxonomy:
- format_error → regenerate code
- test_failure → go to REPAIR
- tool_error → retry once
- timeout → report incompleteTerminology
Related Resources
Original Abstract (Expand)
Agent performance increasingly depends on \emph{harness engineering}, yet harness design is usually buried in controller code and runtime-specific conventions, making it hard to transfer, compare, and study as a scientific object. We ask whether the high-level control logic of an agent harness can instead be externalized as a portable executable artifact. We introduce \textbf{Natural-Language Agent Harnesses} (NLAHs), which express harness behavior in editable natural language, and \textbf{Intelligent Harness Runtime} (IHR), a shared runtime that executes these harnesses through explicit contracts, durable artifacts, and lightweight adapters. Across coding and computer-use benchmarks, we conduct controlled evaluations of operational viability, module ablation, and code-to-text harness migration.