A paradox of AI fluency

TL;DR Highlight

Expert AI users experience more failures, but these are visible and recoverable, while novices often don't recognize their mistakes.

Who Should Read

Product developers building AI chatbot-based products or UX engineers designing user experiences. Those seeking to understand user behavior patterns and failure types in AI services to improve their products.

Core Mechanics

Analysis of 27K conversations from WildChat-4.8M reveals AI fluency is categorized into four levels (minimal/low/moderate/high), with the proportion of high-fluency users remaining consistently low, and new user growth primarily occurring at low/minimal levels.
93% of high-fluency users employ an augmentative style – treating AI as a collaborative tool, refining goals, critically reviewing outputs, and iteratively revising. Conversely, 87% of novice users adopt a delegative style, simply accepting AI results.
Paradoxically, 64% of conversations from skilled users exhibit failure signals, compared to 24% from novices. However, this doesn't mean novices are more successful.
59% of failures among skilled users are 'visible failures' – those they recognize and can address. In contrast, 85.6% of novice failures are 'invisible failures' – conversations that appear successful but ultimately fail to achieve the user's goals.
Skilled users attempt more complex tasks (average task complexity of 3.1 on a 5-point scale) compared to novices (1.5), yet achieve higher success rates with these more challenging tasks.
Regression modeling confirms fluency is a significant predictor of both success rate (p<0.01) and failure visibility (p<0.001). While increased conversation turns and complexity decrease success rates, fluency mitigates this effect.

Evidence

"93% of high fluency users are augmentative vs. less than 1% of minimal fluency users (Figure 3). Failure rates: 64% high fluency vs. 24% minimal fluency — however, 59% of high fluency failures are visible, 85.6% of minimal fluency failures are invisible (Figure 6). Average task complexity: 3.08 for high fluency vs. 1.46 for minimal fluency (5-point scale); successful cases also show a difference: 3.13 vs. 1.79 (Figure 7). Regression model fluency coefficient: +0.111 (p<0.01) for success model, +0.691 (p<0.001) for failure visibility model — higher numbers indicate stronger contribution of fluency (Table 2, 3)."

How to Apply

When designing AI chatbot UIs, prioritize interfaces that encourage critical review of results over 'friction-free' experiences. For example, incorporating checkpoints like 'Is this answer what you were looking for? Are any revisions needed?' can reduce passive acceptance.
Explicitly communicate in user onboarding flows that 'AI can be wrong' and demonstrate examples of iterative refinement – receiving results and then requesting modifications. This effectively reduces the delegative pattern common among novice users.
Implement a pipeline to monitor 'invisible failure' patterns (The Walkaway, The Silent Mismatch) in AI service logs to capture actual failures missed by surface-level completion rate metrics.

Code Example

snippet

# Example LLM annotation prompt for judging user fluency level (based on the paper's annotation protocol)

system_prompt = """
You are an AI fluency annotator. Analyze the following conversation transcript and evaluate the user's AI fluency level.

Evaluate the following dimensions:

1. Interaction Style:
   - augmentative: User iterates collaboratively, refines goals, critically assesses outputs
   - delegative: User passively accepts AI plans and responses
   - other: Neither pattern dominates

2. Fluency Behaviors (mark all that apply):
   - iterative_refinement: User refines requests based on AI output
   - critical_output_evaluation: User questions or challenges AI responses
   - context_provision: User provides rich background context
   - goal_clarification: User clarifies or sharpens their goals mid-conversation
   - decomposition: User breaks complex tasks into subtasks
   - fact_checking: User verifies AI claims

3. Anti-Fluency Behaviors (mark all that apply):
   - passive_acceptance: User accepts AI output without scrutiny
   - vague_delegation: User gives underspecified instructions
   - over_trust: User shows excessive trust in AI responses
   - prompt_flailing: User makes random changes without clear strategy

4. Overall Fluency Assessment: high | moderate | low | minimal

Respond in JSON format.
"""

user_message = f"""
Transcript:
{conversation_transcript}

Provide your fluency annotation:
"""

# Example output structure:
# {
#   "transcript_summary": "User asked for help debugging a React component",
#   "interaction_style": "augmentative",
#   "fluency_behaviors": [
#     {"behavior": "iterative_refinement", "strength": 3, "evidence": "User provided specific error message and asked follow-up"},
#     {"behavior": "critical_output_evaluation", "strength": 2, "evidence": "User questioned whether suggested fix would cause side effects"}
#   ],
#   "anti_fluency_behaviors": [],
#   "fluency_assessment": "high",
#   "assessment_rationale": "User demonstrated clear augmentative behavior..."
# }

Terminology

augmentative 스타일An approach where AI is used as a 'thinking partner,' not simply instructed. It involves collaborative dialogue with feedback, goal refinement, and iterative improvement.

delegative 스타일An approach where users entrust tasks entirely to AI and passively accept the results, without scrutiny or review.

invisible failureA situation where a conversation appears successful, but the user's actual goal remains unachieved, with neither the user nor the AI recognizing the failure.

visible failureA failure where the user recognizes an error or deficiency in the AI's response and reacts accordingly, offering an opportunity for correction.

AI fluencyThe ability to effectively use AI tools, encompassing not only prompt engineering but also understanding AI's limitations and engaging in collaborative interaction.

PPMIA statistical technique measuring how much more frequently two items co-occur than expected by chance. Used here to calculate the association between fluency levels and failure types.

generalized linear mixed-effects modelA statistical model analyzing the simultaneous impact of multiple variables on an outcome. Used to isolate the effect of 'fluency' by controlling for confounding variables like conversation length or task complexity.

The WalkawayA failure pattern where a user abruptly ends a conversation without resolving the issue, making it difficult to detect.

Related Papers

Related Resources

Original Abstract (Expand)

How much does a user's skill with AI shape what AI actually delivers for them? This question is critical for users, AI product builders, and society at large, but it remains underexplored. Using a richly annotated sample of 27K transcripts from WildChat-4.8M, we show that fluent users take on more complex tasks than novices and adopt a fundamentally different interactional mode: they iterate collaboratively with the AI, refining goals and critically assessing outputs, whereas novices take a passive stance. These differences lead to a paradox of AI fluency: fluent users experience more failures than novices -- but their failures tend to be visible (a direct consequence of their engagement), they are more likely to lead to partial recovery, and they occur alongside greater success on complex tasks. Novices, by contrast, more often experience invisible failures: conversations that appear to end successfully but in fact miss the mark. Taken together, these results reframe what success with AI depends on. Individuals should adopt a stance of active engagement rather than passive acceptance. AI product builders should recognize that they are designing not just model behavior but user behavior; encouraging deep engagement, rather than friction-free experiences, will lead to more success overall. Our code and data are available at https://github.com/bigspinai/bigspin-fluency-outcomes