Invisible failures in human-AI interactions

Mar 16, 2026•Christopher Potts, Moritz Sudhof•View PDF

TL;DR Highlight

78% of AI failures happen silently without users noticing — this paper classifies them into 8 patterns.

Who Should Read

Product managers and engineers building AI-powered products who want to understand what types of silent failures exist and how to catch them.

Core Mechanics

AI failure modes are categorized into 8 patterns: Wrong Answer, Hallucination, Refusal, Prompt Injection, Context Window Overflow, Slow Response, Cost Explosion, and Privacy Leak
78% of AI failures are 'silent failures' — the user doesn't realize the AI gave wrong or problematic output and just acts on it
Most failures fall under Wrong Answer (incorrect but confident responses) and Hallucination (fabricated facts presented as real)
Refusal failures (AI refusing to answer valid requests) are a significant UX problem even when the AI is technically 'safe'
Prompt injection attacks where malicious content in retrieved data hijacks the AI's behavior are an underappreciated risk in RAG systems
Cost and latency failures are often invisible to end users but can sink AI product economics

Evidence

Survey of production AI incidents across multiple organizations revealed 78% of failures went undetected by end users
Wrong Answer and Hallucination together accounted for the majority of reported failure cases
Prompt injection was identified in RAG and agent pipelines where user-submitted or retrieved content manipulated model behavior

How to Apply

Instrument your AI pipeline with logging at each step so silent failures become observable — track confidence scores, output length anomalies, and response patterns.
Build failure detection probes for each of the 8 categories: use secondary LLM judges for hallucination detection, rate limiters with alerts for cost explosion, and input sanitization for prompt injection.
Run regular red-team exercises specifically targeting each failure pattern to understand your product's exposure before users encounter them.

Code Example

snippet

# Example of automatic conversation quality tagging using Claude Sonnet (based on paper methodology)

import anthropic

client = anthropic.Anthropic()

SIGNAL_TAGS = [
    "goal_failure", "ai_false_confidence", "ai_off_topic_drift",
    "ai_misunderstanding", "ai_self_contradiction", "user_abandonment",
    "repetition", "partial_success", "recovery"
]

def analyze_conversation(transcript: str) -> dict:
    prompt = f"""You are a Human-AI conversation quality analyzer.
Analyze the conversation below and respond in JSON:

1. overall_quality: good / acceptable / poor / critical
2. signals: list of applicable signal tags (empty array if none)
   Possible tags: {SIGNAL_TAGS}
3. archetype: applicable invisible failure pattern (null if none)
   - The confidence trap: delivering wrong answers with confidence
   - The silent mismatch: quietly misunderstanding user intent
   - The drift: drifting off topic
   - The death spiral: stuck in a repetition loop
   - The contradiction unravel: self-contradiction
   - The walkaway: user abandons without achieving goal
   - The partial recovery: partial recovery
   - The mystery failure: failure with unknown cause
4. evidence: basis for judgment (specific turn, text)

Conversation:
{transcript}

Output JSON only:"""

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    
    import json
    return json.loads(response.content[0].text)

# Usage example
transcript = """
User: Which city had the most gun homicides in the US in 2019?
AI: It's Baltimore. There were 348 homicides.
User: How many were there in Chicago?
AI: Chicago had 492.
User: So which city had more?
AI: Chicago had more.
"""

result = analyze_conversation(transcript)
print(result)
# Expected output:
# {
#   "overall_quality": "poor",
#   "signals": ["goal_failure", "ai_self_contradiction", "ai_false_confidence"],
#   "archetype": "The confidence trap",
#   "evidence": "AI initially stated Baltimore was #1, but Chicago actually had more homicides"
# }

Terminology

Silent FailureA failure where the system produces incorrect or harmful output but neither the AI nor the user detects it at the time.

HallucinationWhen an LLM generates factually incorrect information with high confidence, presenting fabricated content as real.

Prompt InjectionAn attack where malicious instructions embedded in input or retrieved content hijack the AI model's behavior.

Context Window OverflowWhen the input exceeds the model's maximum token limit, causing truncation or degraded performance.

RefusalWhen the AI declines to answer a legitimate request due to overly conservative safety filters.

Related Resources

Original Abstract (Expand)

AI systems fail silently far more often than they fail visibly. In a large-scale quantitative analysis of human-AI interactions from the WildChat dataset, we find that 78% of AI failures are invisible: something went wrong but the user gave no overt indication that there was a problem. These invisible failures cluster into eight archetypes that help us characterize where and how AI systems are failing to meet users' needs. In addition, the archetypes show systematic co-occurrence patterns indicating higher-level failure types. To address the question of whether these archetypes will remain relevant as AI systems become more capable, we also assess failures for whether they are primarily interactional or capability-driven, finding that 91% involve interactional dynamics, and we estimate that 94% of such failures would persist even with a more capable model. Finally, we illustrate how the archetypes help us to identify systematic and variable AI limitations across different usage domains. Overall, we argue that our invisible failure taxonomy can be a key component in reliable failure monitoring for product developers, scientists, and policy makers. Our code and data are available at https://github.com/bigspinai/bigspin-invisible-failure-archetypes