Natural-Language Agent Harnesses

Mar 26, 2026•Linyue Pan, Lexiao Zou, Shuo Guo +2•View PDF

TL;DR Highlight

A framework that writes and shares agent control logic (harness) in natural language instead of code, executed by a shared runtime, enabling comparison, reuse, and analysis of design patterns.

Who Should Read

Backend and AI engineers building complex coding/automation agents with multi-agent frameworks like LangChain and AutoGen. Developers who want to systematically design, and reuse step-by-step control logic for agents.

Core Mechanics

Existing agent harnesses (control stacks) are buried inside controller code, making them hard to port, compare, or analyze — this paper proposes NLAH (Natural-Language Agent Harness), a format that externalizes them in natural language
IHR (Intelligent Harness Runtime) places an LLM inside the runtime loop to directly interpret and execute natural-language harnesses, separating shared runtime policy from per-task harness logic
On the OSWorld benchmark, migrating from a code harness (30.4%) to a natural-language NLAH (47.2%) actually improved performance — attributed to a strategy shift from GUI repair loops to file/shell-based verification
In module combination experiments (RQ2), the self-evolution module was most effective on SWE-bench: 75.2% → 80.0% (+4.8%p). The assumption that 'more structure is always better' was wrong — verifier and multi-candidate search actually degraded performance
The file-backed state module (externalizing state to files) improved OSWorld from 41.7% → 47.2% (+5.5%p). Durability — maintaining state even when context is truncated — is the key benefit
When running the TRAE harness, approximately 90% of total tokens and tool calls originated from delegated child agents, not the parent thread — evidence that multi-agent delegation works in practice

Evidence

"OSWorld: code harness 30.4% vs. NLAH 47.2% (same task family, GPT-5.4 + IHR baseline); SWE-bench Verified: adding self-evolution module 75.2% → 80.0% (+4.8%p), adding file-backed state 75.2% → 76.8% (+1.6%p); TRAE Full IHR: 91.5% of prompt tokens and 90.2% of tool calls originated from child agents (parent thread used only ~9–10%); In Full IHR vs. ablation comparison across 125 SWE-bench samples, over 110 converged to identical results — performance differences concentrated in a small number of edge cases."

How to Apply

"Migrate existing agent pipelines (plan→execute→verify→repair) written in Python code into natural-language NLAH format (YAML/Markdown), explicitly defining stages, roles, contracts, and failure taxonomy so the runtime can interpret and execute them; If long-running agents suffer from context loss, apply the file-backed state module pattern (managing task_state.json, manifest.json, and RESPONSE.md as path-addressable files) to externalize state to files; In multi-agent systems, don't assume that adding a verifier or multi-candidate search always helps — modules that keep the retry loop tight, like self-evolution, can be more efficient, so add modules one at a time and run ablation tests."

Code Example

snippet

# NLAH format example (natural-language harness specification)
Task: Implement a function and ensure it passes tests.

Roles:
  - Planner: Analyze task and create a plan
  - Solver: Responsible for code implementation
  - Debugger: Responsible for fixes upon failure

Stages:
  1. PLAN
     Role: Planner
     Output: task_plan.md
     Contract: Include implementation strategy and list of edge cases

  2. EXECUTE
     Role: Solver
     Output: solution.py
     Contract: Valid Python code, no syntax errors

  3. VERIFY
     Action: run_tests(solution.py)
     If passed → STOP
     If failed → REPAIR

  4. REPAIR
     Role: Debugger
     Input: failing_code + error_message
     Action: fix and overwrite solution.py
     Retry: VERIFY (max 3 attempts)

File-backed State:
  - state/task_history.jsonl  # append-only execution log
  - artifacts/manifest.json   # artifact index
  - children/*/RESPONSE.md    # child agent responses

Failure Taxonomy:
  - format_error → regenerate code
  - test_failure → go to REPAIR
  - tool_error   → retry once
  - timeout      → report incomplete

Terminology

HarnessA control stack that governs the overall flow when an agent works through multiple steps. Like a script supervisor, it's the layer that specifies 'do this at this point, and if it fails, do this instead.'

NLAHNatural-Language Agent Harness. A specification file that expresses harness logic in natural language (e.g., Markdown) rather than code. It doubles as something both humans can read and an LLM can execute.

IHRIntelligent Harness Runtime. The shared runtime that executes NLAH. It places an LLM inside the loop to interpret and execute the natural-language harness step by step, and also handles child agent creation and management.

File-backed StateA pattern where agent state is stored in files rather than in memory. Because the files persist even when context is truncated, it improves the stability of long-running agents.

AblationAn experimental method that measures each component's contribution by removing or adding modules one at a time — systematically measuring 'how much does performance change without this part?'

Self-evolutionA loop pattern where an agent reflects on its failures and redesigns its next attempt differently. The key is not simply retrying, but analyzing 'why did it fail?' and then changing strategy.

SWE-bench VerifiedA benchmark where AI agents resolve real GitHub issues. One of the standard tests for measuring the capability of coding agents.

OSWorldA benchmark where AI agents perform GUI tasks in a real desktop environment. Evaluates general computer-use ability including file editing and browser manipulation.

Related Resources

Original Abstract (Expand)

Agent performance increasingly depends on \emph{harness engineering}, yet harness design is usually buried in controller code and runtime-specific conventions, making it hard to transfer, compare, and study as a scientific object. We ask whether the high-level control logic of an agent harness can instead be externalized as a portable executable artifact. We introduce \textbf{Natural-Language Agent Harnesses} (NLAHs), which express harness behavior in editable natural language, and \textbf{Intelligent Harness Runtime} (IHR), a shared runtime that executes these harnesses through explicit contracts, durable artifacts, and lightweight adapters. Across coding and computer-use benchmarks, we conduct controlled evaluations of operational viability, module ablation, and code-to-text harness migration.