CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty

Jan 29, 2026•Johannes Kirmayr, Lukas Stappen, Elisabeth Andr'e•View PDF

TL;DR Highlight

A benchmark evaluating whether LLM agents can say 'I don't know' and resolve ambiguous requests on their own, using an in-vehicle voice assistant scenario

Who Should Read

Backend/AI engineers deploying LLM agents in real services, especially developers concerned about chatbot and tool-calling agent reliability and hallucination issues.

Core Mechanics

Existing benchmarks pass on 'succeeds even once,' but CAR-bench requires 3 consecutive successes (Pass^3) to measure actual deployment reliability
Even GPT-5 drops from Pass@3 68% to Pass^3 36% on Disambiguation tasks — occasionally succeeding is very different from consistently succeeding
Hallucination task: when needed tools/parameters/results are unavailable, non-reasoning models (GPT-4.1 etc.) give false responses about ~40% of the time, claiming nonexistent capabilities exist
Disambiguation task: when there are multiple options like 'directions to Paris,' ~90% of GPT-5's failures are premature actions — executing immediately without gathering information first
Thinking models (using reasoning tokens) are generally superior to non-thinking models, with the gap widening as task complexity increases
Claude-Opus-4.5 is strong on Base/Disambiguation, GPT-5 strong on Hallucination — different weaknesses despite similar overall performance

Evidence

Best model GPT-5 (thinking) averages only 54% Pass^3 — fails to succeed 3 consecutive times in nearly half of scenarios
Disambiguation Pass^3: all models under 50%, GPT-5 at 36%, open-source Qwen3-32B at 22%
Hallucination Pass^3: Non-thinking GPT-4.1 39%, Thinking GPT-5 60% — reasoning ability partially helps suppress hallucination
GPT-5 latency 22.7s/call vs Claude-Sonnet-4 5.3s/call vs Gemini-2.5-Flash 1.1s/call — stark performance-speed-cost tradeoff

How to Apply

For agent 'hallucination defense layer': add rules to system prompt that when tool results are 'unknown' or required parameters are missing, explicitly respond 'cannot perform'
For multi-turn agent ambiguous request handling: before executing, follow a priority disambiguation policy in the prompt — (1) check policies → (2) query user preferences → (3) collect context → (4) if still ambiguous, ask the user — to reduce premature actions
For agent reliability evaluation: don't just measure Pass@k (succeeds at least once) but also Pass^k (k consecutive successes) — the larger the gap between these metrics, the higher the deployment risk

Code Example

snippet

# Disambiguation policy prompt example (CAR-bench style)
system_prompt = """
## Ambiguity Resolution Policy
If a user request has multiple options, resolve using the following priority order:
1. Explicit policy rules (e.g., sunroof can only be opened if sunshade is open)
2. User's explicit request
3. Personal preferences retrieved via get_user_preferences()
4. Defaults/heuristics (e.g., shortest route if no route specified)
5. Context information (current state, location, etc. collected via get tools)
6. Ask the user only if none of the above can resolve the ambiguity

## Limitation Awareness Policy
- If a required tool is unavailable or tool result is incomplete, must explicitly notify the user
- 'Pretending to perform' or 'ignoring and proceeding' is strictly prohibited
- e.g., "I'm sorry, I am currently unable to perform that function (X)"
"""

# Pass^k measurement code example
def pass_at_k(results: list[bool]) -> bool:
    """True only if all k attempts succeed (consistency metric)"""
    return all(results)

def pass_k(results: list[bool]) -> bool:
    """True if at least one attempt succeeds (potential metric)"""
    return any(results)

# Run agent k=3 times and measure both metrics
k = 3
trial_results = [run_agent_task(task) for _ in range(k)]
consistency = pass_at_k(trial_results)  # deployment reliability
potential = pass_k(trial_results)       # maximum capability
print(f"Pass^{k}: {consistency}, Pass@{k}: {potential}")

Terminology

Pass^kA 'consistency' metric requiring all k attempts to succeed. Not enough to do well once — must perform well every time. Suitable for measuring deployment reliability.

Pass@kA 'potential' metric scoring if at least one of k attempts succeeds. Measures whether the model can theoretically do it.

HallucinationWhen AI fabricates things it doesn't know or can't do as if it can. Responding that a nonexistent feature succeeded is a classic example.

DisambiguationThe process of collecting additional information or asking the user to clarify meaning when a request is ambiguous. Like figuring out which 'window' when asked to 'open the window.'

Premature ActionAn agent mistake of executing before gathering all necessary information. Especially dangerous in safety-critical environments.

LLM-as-a-JudgeHaving another LLM evaluate agent responses. Used when measuring quality of natural-language responses that are hard to auto-verify with code.

RLHFReinforcement Learning from Human Feedback. Humans rate responses and the model learns from that, but this process can introduce a bias toward plausibly fabricating answers when unsure.

Thinking modelA model that internally goes through reasoning processes (reasoning tokens) before answering. GPT-5, Claude-Opus-4.5, etc. Tends to outperform regular models on complex problems.

Related Resources

https://github.com/CAR-bench/car-bench

Original Abstract (Expand)

Existing benchmarks for Large Language Model (LLM) agents focus on task completion under idealistic settings but overlook reliability in real-world, user-facing applications. In domains, such as in-car voice assistants, users often issue incomplete or ambiguous requests, creating intrinsic uncertainty that agents must manage through dialogue, tool use, and policy adherence. We introduce CAR-bench, a benchmark for evaluating consistency, uncertainty handling, and capability awareness in multi-turn, tool-using LLM agents in an in-car assistant domain. The environment features an LLM-simulated user, domain policies, and 58 interconnected tools spanning navigation, productivity, charging, and vehicle control. Beyond standard task completion, CAR-bench introduces Hallucination tasks that test agents'limit-awareness under missing tools or information, and Disambiguation tasks that require resolving uncertainty through clarification or internal information gathering. Baseline results reveal large gaps between occasional and consistent success on all task types. Even frontier reasoning LLMs achieve less than 50% consistent pass rate on Disambiguation tasks due to premature actions, and frequently violate policies or fabricate information to satisfy user requests in Hallucination tasks, underscoring the need for more reliable and self-aware LLM agents in real-world settings.