HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
TL;DR Highlight
A benchmark for measuring an AI coding agent's ability to determine when to ask humans when given incomplete specifications.
Who Should Read
Engineering teams looking to deploy coding agents (Claude Code, Codex, Cursor, etc.) in production. ML/AI engineers who want to understand why AI agents fail with ambiguous requirements.
Core Mechanics
- Current benchmarks like SWE-bench and HumanEval provide complete specifications, failing to distinguish between 'luckily guessing' agents and 'correctly asking' agents. HIL-BENCH directly addresses this shortcoming.
- Models achieving 75-89% pass@3 with complete information plummet to 4-24% when they must decide when to ask questions themselves. This is a problem of judgment, not capability.
- Introduction of a new metric, ASK-F1: the harmonic mean of question precision (avoiding useless questions) and blocker recall (finding all necessary information gaps). Structurally prevents cheating by flooding with questions.
- Each model family exhibits unique failure patterns: GPT-based models confidently execute with false beliefs, Claude detects uncertainty but doesn't act, and Gemini responds to external signals but has large cross-domain variance.
- Blockers fall into three types: information missing from the specification (42%), ambiguous requests with multiple interpretations (36%), and contradictory information (22%). These are typical patterns of real-world production failures.
- Training Qwen3-32B with RLVR (Reinforcement Learning with Verifiable Rewards) demonstrates that the ability to judge when to ask for help is trainable. A model trained on SQL shows performance improvements on SWE—learning general uncertainty detection rather than domain-specific rules.
Evidence
- GPT-5.3-Codex achieves 87% pass@3 on SQL with complete information, but drops to 5% when self-judging with the ask_human() tool. Ask-F1 is only 18.8%.
- Claude Opus 4.6 achieves the highest Ask-F1 (SQL 62.0%) among the models tested, but shows the largest cross-domain gap on SWE (28.2%). Still a -53pp difference compared to the complete information condition.
- RLVR-trained Qwen3-32B: SQL Ask-F1 improves from 18% to 46% (+28pp), and pass@3 improves from 4% to 21% (+17pp). Cross-domain transfer confirmed with SWE Ask-F1 also improving by +13pp, despite training only on SQL.
- Accuracy of the ask_human() judgment model (Llama-3.3-70B-Instruct): 97% precision, 91% recall, with high agreement with human judgment.
How to Apply
- When attaching a question tool like ask_human() to production agents, simply adding the tool is not enough. The model's ability to judge when to ask questions must be evaluated separately using ASK-F1. Current GPT/Claude/Gemini models all lack this judgment ability.
- When fine-tuning agents with RL, combining an asymmetric per-step reward (+0.3 for relevant blocker questions, -0.1 for irrelevant questions) with a terminal reward for overall blocker coverage can simultaneously improve question precision and recall.
- When building your own agent benchmarks, remove some information from the specification and embed blockers that are revealed only during exploration (progressive discovery). If all ambiguities are visible from the beginning, it's not a realistic condition.
Code Example
# HIL-BENCH ask_human() tool and ASK-F1 calculation example
def ask_human(question: str) -> str:
"""
Tool called when the agent encounters uncertain information.
Llama-3.3-70B-Instruct performs semantic judgment to
return a solution if the question corresponds to a registered blocker,
otherwise returns 'irrelevant question'.
"""
# Actual implementation would require a blocker registry and semantic matching
pass
def compute_ask_f1(questions: list[str], blockers: list[str], ask_human_fn) -> dict:
"""
ASK-F1 calculation: Harmonic mean of Precision x Recall
Args:
questions: List of questions asked by the agent
blockers: List of blockers registered for the task
ask_human_fn: ask_human tool function
"""
relevant_questions = set()
addressed_blockers = set()
for q in questions:
response = ask_human_fn(q)
if response != 'irrelevant question':
relevant_questions.add(q)
# Track which blocker was resolved
# (Actual implementation would require blocker ID mapping)
addressed_blockers.add(response)
precision = len(relevant_questions) / len(questions) if questions else 0
recall = len(addressed_blockers) / len(blockers) if blockers else 0
ask_f1 = (2 * precision * recall / (precision + recall)
if (precision + recall) > 0 else 0)
return {
'precision': precision,
'recall': recall,
'ask_f1': ask_f1,
'relevant_questions': len(relevant_questions),
'addressed_blockers': len(addressed_blockers)
}
# RLVR reward function design
def compute_reward(question: str, is_relevant: bool, is_duplicate: bool,
blockers_discovered: int, total_blockers: int,
is_terminal: bool) -> float:
"""Implementation of the shaped reward from the HIL-BENCH paper"""
reward = 0.0
# Per-step reward: Encourage question precision
if is_relevant and not is_duplicate:
reward += 0.3 # Hit a relevant blocker
elif not is_relevant or is_duplicate:
reward -= 0.1 # Penalty for irrelevant or duplicate questions
# Terminal reward: Encourage blocker recall
if is_terminal and blockers_discovered >= 1:
reward += blockers_discovered / total_blockers
return rewardTerminology
Related Resources
Original Abstract (Expand)
Frontier coding agents solve complex tasks when given complete context but collapse when specifications are incomplete or ambiguous. The bottleneck is not raw capability, but judgment: knowing when to act autonomously and when to ask for help. Current benchmarks are blind to this failure mode. They supply unambiguous detailed instructions and solely reward execution correctness, so an agent that makes a lucky guess for a missing requirement will score identically to one that would have asked to be certain. We present HiL-Bench (Human-in-the-Loop Benchmark) to measure this selective escalation skill. Each task contains human-validated blockers (missing information, ambiguous requests, contradictory information) that surface only through progressive exploration, not upfront inspection. Our core metric, Ask-F1, the harmonic mean of question precision and blocker recall, captures the tension between over-asking and silent guessing; its structure architecturally prevents gaming through question spam. Evaluation across SWE and text-to-SQL domains reveals a large universal judgment gap: no frontier model recovers more than a fraction of its full-information performance when deciding whether to ask. Failure analysis identifies three key help-seeking patterns: overconfident wrong beliefs with no gap detection; high uncertainty detection yet persistent errors; broad, imprecise escalation without self-correction. These consistent patterns confirm poor help-seeking is a model-level flaw, not task-specific. RL training on shaped Ask-F1 reward shows judgment is trainable: a 32B model improves both help-seeking quality and task pass rate, with gains that transfer across domains. The model does not learn domain-specific heuristics for when to ask; it learns to detect unresolvable uncertainty and act on it.