Rethinking Multiple-Choice Questions for RLVR: Unlocking Potential via Distractor Design
TL;DR Highlight
Experiments prove that improving distractor quality in multiple-choice questions significantly boosts RLVR training effectiveness, with an automated pipeline to do it.
Who Should Read
Researchers using RL with verifiable rewards (RLVR) for LLM training, and teams building or curating multiple-choice QA datasets for training.
Core Mechanics
- Demonstrated that the quality of wrong answer options (distractors) in multiple-choice problems directly impacts RLVR training effectiveness
- High-quality distractors (plausible, challenging wrong answers) create more informative training signal than easy-to-eliminate distractors
- Proposed an automated pipeline (IDG — Intelligent Distractor Generation) to improve distractor quality using LLMs
- IDG generates distractors that are wrong but plausible, forcing the model to reason more carefully
- Training with IDG-improved data significantly outperforms training on original datasets
- Effect is most pronounced for reasoning-heavy tasks where superficial pattern matching is insufficient
Evidence
- RLVR training with improved distractors shows significant performance gains on reasoning benchmarks
- IDG-generated distractors evaluated by humans as significantly more challenging than original distractors
- Ablation confirms distractor quality is the causal factor, not data quantity or other confounds
- Benefits hold across multiple model sizes and RLVR training setups
How to Apply
- Before running RLVR training on any multiple-choice dataset, run your distractors through an IDG pipeline to upgrade quality
- Use an LLM to generate alternative distractors that are: wrong (verifiable), plausible (topic-relevant), and challenging (require reasoning to eliminate)
- Combine IDG with difficulty filtering to get a training set where every problem requires genuine reasoning
Code Example
Terminology
Original Abstract (Expand)
Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capabilities of Large Language Models. When applied to RLVR, Multiple-Choice Questions (MCQs) offer a scalable source of verifiable data but risk inducing reward hacking, where models shortcut reasoning via random guessing or simple elimination. Current approaches often mitigate this by converting MCQs to open-ended formats, thereby discarding the contrastive signal provided by expert-designed distractors. In this work, we systematically investigate the impact of option design on RLVR. Our analysis highlights two primary insights: (1) Mismatches in option counts between training and testing degrade performance. (2) Strong distractors effectively mitigate random guessing, enabling effective RLVR training even with 2-way questions. Motivated by these findings, we propose Iterative Distractor Curation (IDC), a framework that actively constructs high-quality distractors to block elimination shortcuts and promote deep reasoning. Experiments on various benchmarks demonstrate that our method effectively enhances distractor quality and yields significant gains in RLVR training compared to the original data.