Invisible failures in human-AI interactions
TL;DR Highlight
78% of AI failures happen silently without users noticing — this paper classifies them into 8 patterns.
Who Should Read
Product managers and engineers building AI-powered products who want to understand what types of silent failures exist and how to catch them.
Core Mechanics
- AI failure modes are categorized into 8 patterns: Wrong Answer, Hallucination, Refusal, Prompt Injection, Context Window Overflow, Slow Response, Cost Explosion, and Privacy Leak
- 78% of AI failures are 'silent failures' — the user doesn't realize the AI gave wrong or problematic output and just acts on it
- Most failures fall under Wrong Answer (incorrect but confident responses) and Hallucination (fabricated facts presented as real)
- Refusal failures (AI refusing to answer valid requests) are a significant UX problem even when the AI is technically 'safe'
- Prompt injection attacks where malicious content in retrieved data hijacks the AI's behavior are an underappreciated risk in RAG systems
- Cost and latency failures are often invisible to end users but can sink AI product economics
Evidence
- Survey of production AI incidents across multiple organizations revealed 78% of failures went undetected by end users
- Wrong Answer and Hallucination together accounted for the majority of reported failure cases
- Prompt injection was identified in RAG and agent pipelines where user-submitted or retrieved content manipulated model behavior
How to Apply
- Instrument your AI pipeline with logging at each step so silent failures become observable — track confidence scores, output length anomalies, and response patterns.
- Build failure detection probes for each of the 8 categories: use secondary LLM judges for hallucination detection, rate limiters with alerts for cost explosion, and input sanitization for prompt injection.
- Run regular red-team exercises specifically targeting each failure pattern to understand your product's exposure before users encounter them.
Code Example
# Example of automatic conversation quality tagging using Claude Sonnet (based on paper methodology)
import anthropic
client = anthropic.Anthropic()
SIGNAL_TAGS = [
"goal_failure", "ai_false_confidence", "ai_off_topic_drift",
"ai_misunderstanding", "ai_self_contradiction", "user_abandonment",
"repetition", "partial_success", "recovery"
]
def analyze_conversation(transcript: str) -> dict:
prompt = f"""You are a Human-AI conversation quality analyzer.
Analyze the conversation below and respond in JSON:
1. overall_quality: good / acceptable / poor / critical
2. signals: list of applicable signal tags (empty array if none)
Possible tags: {SIGNAL_TAGS}
3. archetype: applicable invisible failure pattern (null if none)
- The confidence trap: delivering wrong answers with confidence
- The silent mismatch: quietly misunderstanding user intent
- The drift: drifting off topic
- The death spiral: stuck in a repetition loop
- The contradiction unravel: self-contradiction
- The walkaway: user abandons without achieving goal
- The partial recovery: partial recovery
- The mystery failure: failure with unknown cause
4. evidence: basis for judgment (specific turn, text)
Conversation:
{transcript}
Output JSON only:"""
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
import json
return json.loads(response.content[0].text)
# Usage example
transcript = """
User: Which city had the most gun homicides in the US in 2019?
AI: It's Baltimore. There were 348 homicides.
User: How many were there in Chicago?
AI: Chicago had 492.
User: So which city had more?
AI: Chicago had more.
"""
result = analyze_conversation(transcript)
print(result)
# Expected output:
# {
# "overall_quality": "poor",
# "signals": ["goal_failure", "ai_self_contradiction", "ai_false_confidence"],
# "archetype": "The confidence trap",
# "evidence": "AI initially stated Baltimore was #1, but Chicago actually had more homicides"
# }Terminology
Related Resources
Original Abstract (Expand)
AI systems fail silently far more often than they fail visibly. In a large-scale quantitative analysis of human-AI interactions from the WildChat dataset, we find that 78% of AI failures are invisible: something went wrong but the user gave no overt indication that there was a problem. These invisible failures cluster into eight archetypes that help us characterize where and how AI systems are failing to meet users' needs. In addition, the archetypes show systematic co-occurrence patterns indicating higher-level failure types. To address the question of whether these archetypes will remain relevant as AI systems become more capable, we also assess failures for whether they are primarily interactional or capability-driven, finding that 91% involve interactional dynamics, and we estimate that 94% of such failures would persist even with a more capable model. Finally, we illustrate how the archetypes help us to identify systematic and variable AI limitations across different usage domains. Overall, we argue that our invisible failure taxonomy can be a key component in reliable failure monitoring for product developers, scientists, and policy makers. Our code and data are available at https://github.com/bigspinai/bigspin-invisible-failure-archetypes