A paradox of AI fluency
TL;DR Highlight
Expert AI users experience more failures, but these are visible and recoverable, while novices often don't recognize their mistakes.
Who Should Read
Product developers building AI chatbot-based products or UX engineers designing user experiences. Those seeking to understand user behavior patterns and failure types in AI services to improve their products.
Core Mechanics
- Analysis of 27K conversations from WildChat-4.8M reveals AI fluency is categorized into four levels (minimal/low/moderate/high), with the proportion of high-fluency users remaining consistently low, and new user growth primarily occurring at low/minimal levels.
- 93% of high-fluency users employ an augmentative style – treating AI as a collaborative tool, refining goals, critically reviewing outputs, and iteratively revising. Conversely, 87% of novice users adopt a delegative style, simply accepting AI results.
- Paradoxically, 64% of conversations from skilled users exhibit failure signals, compared to 24% from novices. However, this doesn't mean novices are more successful.
- 59% of failures among skilled users are 'visible failures' – those they recognize and can address. In contrast, 85.6% of novice failures are 'invisible failures' – conversations that appear successful but ultimately fail to achieve the user's goals.
- Skilled users attempt more complex tasks (average task complexity of 3.1 on a 5-point scale) compared to novices (1.5), yet achieve higher success rates with these more challenging tasks.
- Regression modeling confirms fluency is a significant predictor of both success rate (p<0.01) and failure visibility (p<0.001). While increased conversation turns and complexity decrease success rates, fluency mitigates this effect.
Evidence
- "93% of high fluency users are augmentative vs. less than 1% of minimal fluency users (Figure 3). Failure rates: 64% high fluency vs. 24% minimal fluency — however, 59% of high fluency failures are visible, 85.6% of minimal fluency failures are invisible (Figure 6). Average task complexity: 3.08 for high fluency vs. 1.46 for minimal fluency (5-point scale); successful cases also show a difference: 3.13 vs. 1.79 (Figure 7). Regression model fluency coefficient: +0.111 (p<0.01) for success model, +0.691 (p<0.001) for failure visibility model — higher numbers indicate stronger contribution of fluency (Table 2, 3)."
How to Apply
- When designing AI chatbot UIs, prioritize interfaces that encourage critical review of results over 'friction-free' experiences. For example, incorporating checkpoints like 'Is this answer what you were looking for? Are any revisions needed?' can reduce passive acceptance.
- Explicitly communicate in user onboarding flows that 'AI can be wrong' and demonstrate examples of iterative refinement – receiving results and then requesting modifications. This effectively reduces the delegative pattern common among novice users.
- Implement a pipeline to monitor 'invisible failure' patterns (The Walkaway, The Silent Mismatch) in AI service logs to capture actual failures missed by surface-level completion rate metrics.
Code Example
# Example LLM annotation prompt for judging user fluency level (based on the paper's annotation protocol)
system_prompt = """
You are an AI fluency annotator. Analyze the following conversation transcript and evaluate the user's AI fluency level.
Evaluate the following dimensions:
1. Interaction Style:
- augmentative: User iterates collaboratively, refines goals, critically assesses outputs
- delegative: User passively accepts AI plans and responses
- other: Neither pattern dominates
2. Fluency Behaviors (mark all that apply):
- iterative_refinement: User refines requests based on AI output
- critical_output_evaluation: User questions or challenges AI responses
- context_provision: User provides rich background context
- goal_clarification: User clarifies or sharpens their goals mid-conversation
- decomposition: User breaks complex tasks into subtasks
- fact_checking: User verifies AI claims
3. Anti-Fluency Behaviors (mark all that apply):
- passive_acceptance: User accepts AI output without scrutiny
- vague_delegation: User gives underspecified instructions
- over_trust: User shows excessive trust in AI responses
- prompt_flailing: User makes random changes without clear strategy
4. Overall Fluency Assessment: high | moderate | low | minimal
Respond in JSON format.
"""
user_message = f"""
Transcript:
{conversation_transcript}
Provide your fluency annotation:
"""
# Example output structure:
# {
# "transcript_summary": "User asked for help debugging a React component",
# "interaction_style": "augmentative",
# "fluency_behaviors": [
# {"behavior": "iterative_refinement", "strength": 3, "evidence": "User provided specific error message and asked follow-up"},
# {"behavior": "critical_output_evaluation", "strength": 2, "evidence": "User questioned whether suggested fix would cause side effects"}
# ],
# "anti_fluency_behaviors": [],
# "fluency_assessment": "high",
# "assessment_rationale": "User demonstrated clear augmentative behavior..."
# }Terminology
Related Papers
Show HN: A new benchmark for testing LLMs for deterministic outputs
Structured Output Benchmark assesses LLM JSON handling across seven metrics, revealing performance beyond schema compliance.
Claude.ai unavailable and elevated errors on the API
Anthropic’s entire service suite—Claude.ai, the API, Claude Code—became inaccessible for 1 hour and 18 minutes (17:34–18:52 UTC), sparking outrage among enterprise users over reliability concerns.
4TB of voice samples just stolen from 40k AI contractors at Mercor
Mercor data breach exposes voice recordings and ID scans of 40,000 contractors, fueling deepfake and voice fraud risks.
I cancelled Claude: Token issues, declining quality, and poor support
Anthropic’s Claude Code Pro experienced a three-week decline in speed, token allowance, and support quality, sparking a community discussion among developers.
Different Language Models Learn Similar Number Representations
LLMs, regardless of architecture—from Transformers to LSTMs—consistently learn periodic patterns with periods T=2, 5, and 10 when representing numbers, mathematically explaining a 'convergent evolution' phenomenon beyond model architecture.
Diagnosing CFG Interpretation in LLMs
Related Resources
Original Abstract (Expand)
How much does a user's skill with AI shape what AI actually delivers for them? This question is critical for users, AI product builders, and society at large, but it remains underexplored. Using a richly annotated sample of 27K transcripts from WildChat-4.8M, we show that fluent users take on more complex tasks than novices and adopt a fundamentally different interactional mode: they iterate collaboratively with the AI, refining goals and critically assessing outputs, whereas novices take a passive stance. These differences lead to a paradox of AI fluency: fluent users experience more failures than novices -- but their failures tend to be visible (a direct consequence of their engagement), they are more likely to lead to partial recovery, and they occur alongside greater success on complex tasks. Novices, by contrast, more often experience invisible failures: conversations that appear to end successfully but in fact miss the mark. Taken together, these results reframe what success with AI depends on. Individuals should adopt a stance of active engagement rather than passive acceptance. AI product builders should recognize that they are designing not just model behavior but user behavior; encouraging deep engagement, rather than friction-free experiences, will lead to more success overall. Our code and data are available at https://github.com/bigspinai/bigspin-fluency-outcomes