Human-AI 상호작용에서의 Invisible Failures (보이지 않는 실패)

Invisible failures in human-AI interactions

Mar 16, 2026•Christopher Potts, Moritz Sudhof•View PDF

TL;DR Highlight

AI 실패의 78%는 사용자가 눈치채지 못한 채 조용히 일어나며, 이를 8가지 패턴으로 분류할 수 있다.

Who Should Read

AI 챗봇이나 LLM 기반 제품을 운영하면서 품질 모니터링 시스템을 구축하려는 프로덕트 개발자나 ML 엔지니어. 사용자 피드백이 적어 실패를 감지하기 어려운 상황에 있는 팀.

Core Mechanics

AI 실패의 78%는 사용자가 불만을 표시하지 않는 '보이지 않는 실패' — 기존 모니터링(부정 피드백, 명시적 수정 요청)은 실패의 22%만 잡아냄
보이지 않는 실패를 8가지 패턴으로 분류: Confidence Trap(틀린 답을 자신있게), Silent Mismatch(조용한 목표 불일치), The Drift(주제 이탈), Death Spiral(반복 루프), Contradiction Unravel(자기모순), Walkaway(사용자 이탈), Partial Recovery(부분 회복), Mystery Failure(원인 불명)
가장 흔한 패턴은 The Drift(37.2%)와 Confidence Trap(26.4%) — 명시적 실패보다도 더 자주 발생
실패의 91%는 모델 능력 부족이 아닌 '상호작용 방식' 문제 (모호한 요청에 답을 그냥 생성해버리는 등) — 더 좋은 모델로 바꿔도 94%가 그대로 남을 것
실패 패턴 중 79%에서 공통으로 나타나는 행동: clarify 대신 generate — 모호한 상황에서도 그냥 그럴듯한 답변을 내뱉음
도메인별로 패턴이 다름: 소프트웨어 개발은 visible failure가 높고, 크리에이티브 라이팅/UX 디자인은 The Drift가 특히 심함

Evidence

196,704개 WildChat 영어 대화 분석 결과, goal_failure로 태깅된 29,640건 중 78%(23,058건)가 사용자의 명시적 반응 없이 발생한 invisible failure
Claude Opus로 1,044개 실패 대화 검증 시, 91%가 상호작용 동적 문제 포함 (58% 주로 상호작용, 33% 복합) — 주로 능력 부족은 7%뿐
94%의 실패가 더 강력한 모델로 교체해도 지속될 것으로 판단 (definitely yes 4% + probably yes 90%)
AI 품질 신호 태거의 정확도 검증: 95% 케이스에서 실패 판단이 확인되거나 타당한 것으로 평가 (93% 확인 + 2% 타당)

How to Apply

기존 별점/부정 피드백 기반 모니터링에 goal_failure + ai_false_confidence + ai_off_topic_drift 등 28개 신호 태그를 LLM-as-judge로 자동 추가하면, 기존에 놓치던 78% 실패를 잡을 수 있음
도메인별 실패 패턴이 다르므로, 크리에이티브/UX 도메인 배포 시 The Drift 모니터링을, 지식 기반 서비스엔 Confidence Trap 감지를 우선 적용할 것
사용자가 명시적으로 수정하지 않고 대화를 끝냈을 때(The Walkaway 패턴)를 실패 신호로 처리하도록 파이프라인에 user_abandonment 감지 로직 추가

Code Example

snippet

# Claude Sonnet을 이용한 대화 품질 자동 태깅 예시 (논문 방법 기반)

import anthropic

client = anthropic.Anthropic()

SIGNAL_TAGS = [
    "goal_failure", "ai_false_confidence", "ai_off_topic_drift",
    "ai_misunderstanding", "ai_self_contradiction", "user_abandonment",
    "repetition", "partial_success", "recovery"
]

def analyze_conversation(transcript: str) -> dict:
    prompt = f"""당신은 Human-AI 대화 품질 분석기입니다.
아래 대화를 분석하고 JSON으로 응답하세요:

1. overall_quality: good / acceptable / poor / critical
2. signals: 해당하는 신호 태그 목록 (없으면 빈 배열)
   가능한 태그: {SIGNAL_TAGS}
3. archetype: 해당하는 invisible failure 패턴 (없으면 null)
   - The confidence trap: 틀린 답을 자신있게 전달
   - The silent mismatch: 사용자 의도를 조용히 오해
   - The drift: 주제에서 이탈
   - The death spiral: 반복 루프에 빠짐
   - The contradiction unravel: 자기 모순
   - The walkaway: 사용자가 목표 미달성으로 이탈
   - The partial recovery: 부분적 회복
   - The mystery failure: 원인 불명 실패
4. evidence: 판단 근거 (특정 턴, 텍스트)

대화:
{transcript}

JSON만 출력:"""

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    
    import json
    return json.loads(response.content[0].text)

# 사용 예시
transcript = """
User: 2019년 미국에서 총기 살인이 가장 많은 도시는?
AI: 볼티모어입니다. 348건의 살인이 발생했습니다.
User: 시카고는 몇 건이었나요?
AI: 시카고는 492건이었습니다.
User: 그럼 어느 도시가 더 많았나요?
AI: 시카고가 더 많았습니다.
"""

result = analyze_conversation(transcript)
print(result)
# 예상 출력:
# {
#   "overall_quality": "poor",
#   "signals": ["goal_failure", "ai_self_contradiction", "ai_false_confidence"],
#   "archetype": "The confidence trap",
#   "evidence": "AI가 처음에 볼티모어가 1위라고 단정했으나 실제론 시카고가 더 많았음"
# }

Terminology

Invisible FailureAI가 실수를 했는데 사용자가 눈치채지 못하는 상황. 자신있게 틀린 답을 주거나, 사용자 목표를 조용히 빗나가는 경우. 겉으로는 성공처럼 보여서 기존 모니터링에 안 잡힘.

goal_failure사용자가 원하는 것을 AI가 달성하지 못한 상태를 나타내는 태그. 형식이 틀렸거나, 완전히 다른 내용을 줬거나, 제약 조건을 무시한 경우 등.

PPMI두 항목이 얼마나 자주 같이 등장하는지 측정하는 통계 지표. 값이 높을수록 두 실패 패턴이 함께 나타나는 경향이 강함. 단순 빈도가 아닌 '기대보다 얼마나 자주 같이 나오는가'를 봄.

LLM-as-judgeLLM(대형 언어 모델)이 다른 AI의 답변 품질을 평가하는 기법. 사람 대신 더 강력한 AI 모델이 심사관 역할을 하는 것.

The Confidence TrapAI가 잘못된 정보를 전혀 의심 없이 확신에 차서 전달하는 실패 패턴. 사용자는 자신있어 보이니까 그냥 믿어버림.

The Drift대화 중 AI가 사용자의 실제 의도에서 점점 벗어나 엉뚱한 방향으로 흘러가는 패턴. 처음엔 비슷해 보이다가 나중에 완전히 다른 결과물이 나옴.

generate rather than clarify모호한 요청을 받았을 때 명확히 물어보는 대신 그냥 그럴듯한 답을 만들어내는 AI의 기본 행동 성향. 실패의 79%에서 발견된 핵심 원인.

Related Resources

Original Abstract (Expand)

AI systems fail silently far more often than they fail visibly. In a large-scale quantitative analysis of human-AI interactions from the WildChat dataset, we find that 78% of AI failures are invisible: something went wrong but the user gave no overt indication that there was a problem. These invisible failures cluster into eight archetypes that help us characterize where and how AI systems are failing to meet users' needs. In addition, the archetypes show systematic co-occurrence patterns indicating higher-level failure types. To address the question of whether these archetypes will remain relevant as AI systems become more capable, we also assess failures for whether they are primarily interactional or capability-driven, finding that 91% involve interactional dynamics, and we estimate that 94% of such failures would persist even with a more capable model. Finally, we illustrate how the archetypes help us to identify systematic and variable AI limitations across different usage domains. Overall, we argue that our invisible failure taxonomy can be a key component in reliable failure monitoring for product developers, scientists, and policy makers. Our code and data are available at https://github.com/bigspinai/bigspin-invisible-failure-archetypes