CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty
TL;DR Highlight
A benchmark evaluating whether LLM agents can say 'I don't know' and resolve ambiguous requests on their own, using an in-vehicle voice assistant scenario
Who Should Read
Backend/AI engineers deploying LLM agents in real services, especially developers concerned about chatbot and tool-calling agent reliability and hallucination issues.
Core Mechanics
- Existing benchmarks pass on 'succeeds even once,' but CAR-bench requires 3 consecutive successes (Pass^3) to measure actual deployment reliability
- Even GPT-5 drops from Pass@3 68% to Pass^3 36% on Disambiguation tasks — occasionally succeeding is very different from consistently succeeding
- Hallucination task: when needed tools/parameters/results are unavailable, non-reasoning models (GPT-4.1 etc.) give false responses about ~40% of the time, claiming nonexistent capabilities exist
- Disambiguation task: when there are multiple options like 'directions to Paris,' ~90% of GPT-5's failures are premature actions — executing immediately without gathering information first
- Thinking models (using reasoning tokens) are generally superior to non-thinking models, with the gap widening as task complexity increases
- Claude-Opus-4.5 is strong on Base/Disambiguation, GPT-5 strong on Hallucination — different weaknesses despite similar overall performance
Evidence
- Best model GPT-5 (thinking) averages only 54% Pass^3 — fails to succeed 3 consecutive times in nearly half of scenarios
- Disambiguation Pass^3: all models under 50%, GPT-5 at 36%, open-source Qwen3-32B at 22%
- Hallucination Pass^3: Non-thinking GPT-4.1 39%, Thinking GPT-5 60% — reasoning ability partially helps suppress hallucination
- GPT-5 latency 22.7s/call vs Claude-Sonnet-4 5.3s/call vs Gemini-2.5-Flash 1.1s/call — stark performance-speed-cost tradeoff
How to Apply
- For agent 'hallucination defense layer': add rules to system prompt that when tool results are 'unknown' or required parameters are missing, explicitly respond 'cannot perform'
- For multi-turn agent ambiguous request handling: before executing, follow a priority disambiguation policy in the prompt — (1) check policies → (2) query user preferences → (3) collect context → (4) if still ambiguous, ask the user — to reduce premature actions
- For agent reliability evaluation: don't just measure Pass@k (succeeds at least once) but also Pass^k (k consecutive successes) — the larger the gap between these metrics, the higher the deployment risk
Code Example
# Disambiguation policy prompt example (CAR-bench style)
system_prompt = """
## Ambiguity Resolution Policy
If a user request has multiple options, resolve using the following priority order:
1. Explicit policy rules (e.g., sunroof can only be opened if sunshade is open)
2. User's explicit request
3. Personal preferences retrieved via get_user_preferences()
4. Defaults/heuristics (e.g., shortest route if no route specified)
5. Context information (current state, location, etc. collected via get tools)
6. Ask the user only if none of the above can resolve the ambiguity
## Limitation Awareness Policy
- If a required tool is unavailable or tool result is incomplete, must explicitly notify the user
- 'Pretending to perform' or 'ignoring and proceeding' is strictly prohibited
- e.g., "I'm sorry, I am currently unable to perform that function (X)"
"""
# Pass^k measurement code example
def pass_at_k(results: list[bool]) -> bool:
"""True only if all k attempts succeed (consistency metric)"""
return all(results)
def pass_k(results: list[bool]) -> bool:
"""True if at least one attempt succeeds (potential metric)"""
return any(results)
# Run agent k=3 times and measure both metrics
k = 3
trial_results = [run_agent_task(task) for _ in range(k)]
consistency = pass_at_k(trial_results) # deployment reliability
potential = pass_k(trial_results) # maximum capability
print(f"Pass^{k}: {consistency}, Pass@{k}: {potential}")Terminology
Related Resources
Original Abstract (Expand)
Existing benchmarks for Large Language Model (LLM) agents focus on task completion under idealistic settings but overlook reliability in real-world, user-facing applications. In domains, such as in-car voice assistants, users often issue incomplete or ambiguous requests, creating intrinsic uncertainty that agents must manage through dialogue, tool use, and policy adherence. We introduce CAR-bench, a benchmark for evaluating consistency, uncertainty handling, and capability awareness in multi-turn, tool-using LLM agents in an in-car assistant domain. The environment features an LLM-simulated user, domain policies, and 58 interconnected tools spanning navigation, productivity, charging, and vehicle control. Beyond standard task completion, CAR-bench introduces Hallucination tasks that test agents'limit-awareness under missing tools or information, and Disambiguation tasks that require resolving uncertainty through clarification or internal information gathering. Baseline results reveal large gaps between occasional and consistent success on all task types. Even frontier reasoning LLMs achieve less than 50% consistent pass rate on Disambiguation tasks due to premature actions, and frequently violate policies or fabricate information to satisfy user requests in Hallucination tasks, underscoring the need for more reliable and self-aware LLM agents in real-world settings.