HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

Apr 10, 2026•Mohamed Elfeki, Tu Trinh, Kelvin Luu +9•View PDF

TL;DR Highlight

A benchmark for measuring an AI coding agent's ability to determine when to ask humans when given incomplete specifications.

Who Should Read

Engineering teams looking to deploy coding agents (Claude Code, Codex, Cursor, etc.) in production. ML/AI engineers who want to understand why AI agents fail with ambiguous requirements.

Core Mechanics

Current benchmarks like SWE-bench and HumanEval provide complete specifications, failing to distinguish between 'luckily guessing' agents and 'correctly asking' agents. HIL-BENCH directly addresses this shortcoming.
Models achieving 75-89% pass@3 with complete information plummet to 4-24% when they must decide when to ask questions themselves. This is a problem of judgment, not capability.
Introduction of a new metric, ASK-F1: the harmonic mean of question precision (avoiding useless questions) and blocker recall (finding all necessary information gaps). Structurally prevents cheating by flooding with questions.
Each model family exhibits unique failure patterns: GPT-based models confidently execute with false beliefs, Claude detects uncertainty but doesn't act, and Gemini responds to external signals but has large cross-domain variance.
Blockers fall into three types: information missing from the specification (42%), ambiguous requests with multiple interpretations (36%), and contradictory information (22%). These are typical patterns of real-world production failures.
Training Qwen3-32B with RLVR (Reinforcement Learning with Verifiable Rewards) demonstrates that the ability to judge when to ask for help is trainable. A model trained on SQL shows performance improvements on SWE—learning general uncertainty detection rather than domain-specific rules.

Evidence

GPT-5.3-Codex achieves 87% pass@3 on SQL with complete information, but drops to 5% when self-judging with the ask_human() tool. Ask-F1 is only 18.8%.
Claude Opus 4.6 achieves the highest Ask-F1 (SQL 62.0%) among the models tested, but shows the largest cross-domain gap on SWE (28.2%). Still a -53pp difference compared to the complete information condition.
RLVR-trained Qwen3-32B: SQL Ask-F1 improves from 18% to 46% (+28pp), and pass@3 improves from 4% to 21% (+17pp). Cross-domain transfer confirmed with SWE Ask-F1 also improving by +13pp, despite training only on SQL.
Accuracy of the ask_human() judgment model (Llama-3.3-70B-Instruct): 97% precision, 91% recall, with high agreement with human judgment.

How to Apply

When attaching a question tool like ask_human() to production agents, simply adding the tool is not enough. The model's ability to judge when to ask questions must be evaluated separately using ASK-F1. Current GPT/Claude/Gemini models all lack this judgment ability.
When fine-tuning agents with RL, combining an asymmetric per-step reward (+0.3 for relevant blocker questions, -0.1 for irrelevant questions) with a terminal reward for overall blocker coverage can simultaneously improve question precision and recall.
When building your own agent benchmarks, remove some information from the specification and embed blockers that are revealed only during exploration (progressive discovery). If all ambiguities are visible from the beginning, it's not a realistic condition.

Code Example

snippet

# HIL-BENCH ask_human() tool and ASK-F1 calculation example

def ask_human(question: str) -> str:
    """
    Tool called when the agent encounters uncertain information.
    Llama-3.3-70B-Instruct performs semantic judgment to
    return a solution if the question corresponds to a registered blocker,
    otherwise returns 'irrelevant question'.
    """
    # Actual implementation would require a blocker registry and semantic matching
    pass

def compute_ask_f1(questions: list[str], blockers: list[str], ask_human_fn) -> dict:
    """
    ASK-F1 calculation: Harmonic mean of Precision x Recall
    
    Args:
        questions: List of questions asked by the agent
        blockers: List of blockers registered for the task
        ask_human_fn: ask_human tool function
    """
    relevant_questions = set()
    addressed_blockers = set()
    
    for q in questions:
        response = ask_human_fn(q)
        if response != 'irrelevant question':
            relevant_questions.add(q)
            # Track which blocker was resolved
            # (Actual implementation would require blocker ID mapping)
            addressed_blockers.add(response)
    
    precision = len(relevant_questions) / len(questions) if questions else 0
    recall = len(addressed_blockers) / len(blockers) if blockers else 0
    
    ask_f1 = (2 * precision * recall / (precision + recall)
              if (precision + recall) > 0 else 0)
    
    return {
        'precision': precision,
        'recall': recall,
        'ask_f1': ask_f1,
        'relevant_questions': len(relevant_questions),
        'addressed_blockers': len(addressed_blockers)
    }

# RLVR reward function design
def compute_reward(question: str, is_relevant: bool, is_duplicate: bool,
                   blockers_discovered: int, total_blockers: int,
                   is_terminal: bool) -> float:
    """Implementation of the shaped reward from the HIL-BENCH paper"""
    reward = 0.0
    
    # Per-step reward: Encourage question precision
    if is_relevant and not is_duplicate:
        reward += 0.3   # Hit a relevant blocker
    elif not is_relevant or is_duplicate:
        reward -= 0.1   # Penalty for irrelevant or duplicate questions
    
    # Terminal reward: Encourage blocker recall
    if is_terminal and blockers_discovered >= 1:
        reward += blockers_discovered / total_blockers
    
    return reward

Terminology

ASK-F1A score that simultaneously measures how 'accurately' + 'completely' an agent asks questions. Like F1, it's the harmonic mean of precision and recall, so a high score requires both to be high.

blockerAn information gap that must be resolved to complete the task. This includes values intentionally omitted from the specification, ambiguous requirements, and contradictory instructions.

progressive discoveryA design where information gaps are revealed gradually during problem exploration. You don't know about the roadblocks from the initial specification alone; they appear when you actually look at the code or database.

selective escalationThe ability of an agent to appropriately escalate (ask for help) to a human when it encounters uncertainty it cannot resolve on its own. Asking too much or too little is both bad.

RLVRA method of Reinforcement Learning (RL) with Verifiable Rewards. Applicable to tasks where the correct answer can be clearly determined as right or wrong, enabling learning without human feedback.

pass@3A success rate metric where passing means succeeding at least once in three attempts. It better reflects actual ability than relying on a single lucky attempt.

LoRAA fine-tuning technique that learns only a small number of additional parameters without retraining the entire model. Allows for customizing large models with much less GPU memory.

judgment gapThe performance difference between having complete information and having to self-judge and ask questions. This paper observes this gap reaching up to -82pp.

Related Resources

Original Abstract (Expand)

Frontier coding agents solve complex tasks when given complete context but collapse when specifications are incomplete or ambiguous. The bottleneck is not raw capability, but judgment: knowing when to act autonomously and when to ask for help. Current benchmarks are blind to this failure mode. They supply unambiguous detailed instructions and solely reward execution correctness, so an agent that makes a lucky guess for a missing requirement will score identically to one that would have asked to be certain. We present HiL-Bench (Human-in-the-Loop Benchmark) to measure this selective escalation skill. Each task contains human-validated blockers (missing information, ambiguous requests, contradictory information) that surface only through progressive exploration, not upfront inspection. Our core metric, Ask-F1, the harmonic mean of question precision and blocker recall, captures the tension between over-asking and silent guessing; its structure architecturally prevents gaming through question spam. Evaluation across SWE and text-to-SQL domains reveals a large universal judgment gap: no frontier model recovers more than a fraction of its full-information performance when deciding whether to ask. Failure analysis identifies three key help-seeking patterns: overconfident wrong beliefs with no gap detection; high uncertainty detection yet persistent errors; broad, imprecise escalation without self-correction. These consistent patterns confirm poor help-seeking is a model-level flaw, not task-specific. RL training on shaped Ask-F1 reward shows judgment is trainable: a 32B model improves both help-seeking quality and task pass rate, with gains that transfer across domains. The model does not learn domain-specific heuristics for when to ask; it learns to detect unresolvable uncertainty and act on it.