More Isn't Always Better: Balancing Decision Accuracy and Conformity Pressures in Multi-AI Advice
TL;DR Highlight
A 348-person experiment proves that 3-panel AI improves decision accuracy over a single AI, but 5-panel adds no benefit — and unanimous AI agreement triggers dangerous over-reliance.
Who Should Read
Product developers and UX engineers building advisory services that combine multiple chatbots or AI assistants. Especially teams designing AI decision-support systems in healthcare, legal, or finance.
Core Mechanics
- AI panel of 3 improves accuracy over single AI (Income task: 0.706→0.737), but adding a 5th shows no further gain — "more is better" is wrong here
- When all AIs unanimously agree (CON), users blindly follow — overreliance — with Switch Fraction spiking up to 0.88
- Even a single dissenting AI in the panel meaningfully reduces conformity pressure and increases the rate of users maintaining their own judgment (RSR)
- When a 5-panel splits 3:2 (DIV_3), it only creates confusion with no accuracy gain — evenly split AI disagreement is counterproductive
- Humanizing AIs (photos, names, conversational tone) has no significant effect on average accuracy or reliance, though it does increase perceived usefulness on the Dating task
- Used GPT-4o to generate SHAP-based natural language explanations attached to AI advice — reduces hallucination while improving interpretability
Evidence
- Income task: AI_3 significantly outperforms AI_1 (0.706→0.737, p=.012); AI_5 shows no significant difference
- Dating task: AI_3 outperforms AI_1 (median 0.64→0.68, p=.002); AI_5 is borderline (p=.064)
- 5-panel CON condition: Agreement Fraction 0.99, Switch Fraction 0.88 (near-unconditional AI following); DIV_3 condition: Switch Fraction drops to 0.30 with no accuracy improvement
- 3-panel CON vs DIV: RAIR (rate of following correct AI) is higher in CON (Income: 0.90 vs 0.46), RSR (rate of holding correct own answer against wrong AI) is higher in DIV (0.60 vs 0.21, p<.001)
How to Apply
- When showing multiple AI outputs to users simultaneously, default to 3 and highlight minority opinions separately to encourage critical review
- Add a reflection trigger in the UI when the AI panel is unanimous — e.g. "AIs agree, but please review this yourself" — to prevent blind over-reliance
- Limit humanization elements (avatars, names, conversational tone) to specific tasks where emotional judgment matters. Average accuracy and reliance are largely unaffected, so avoid unnecessary implementation cost
Code Example
Related Resources
Original Abstract (Expand)
Just as people improve decision-making by consulting diverse human advisors, they can now also consult with multiple AI systems. Prior work on group decision-making shows that advice aggregation creates pressure to conform, leading to overreliance. However, the conditions under which multi-AI consultation improves or undermines human decision-making remain unclear. We conducted experiments with three tasks in which participants received advice from panels of AIs. We varied panel size, within-panel consensus, and the human-likeness of presentation. Accuracy improved for small panels relative to a single AI; larger panels yielded no gains. The level of within-panel consensus affected participants' reliance on AI advice: High consensus fostered overreliance; a single dissent reduced pressure to conform; wide disagreement created confusion and undermined appropriate reliance. Human-like presentations increased perceived usefulness and agency in certain tasks, without raising conformity pressure. These findings yield design implications for presenting multi-AI advice that preserve accuracy while mitigating conformity.