Rethinking Multiple-Choice Questions for RLVR: Unlocking Potential via Distractor Design

Mar 13, 2026•Xu Guo, Qiming Ge, Jian Tong +8•View PDF

TL;DR Highlight

Experiments prove that improving distractor quality in multiple-choice questions significantly boosts RLVR training effectiveness, with an automated pipeline to do it.

Who Should Read

Researchers using RL with verifiable rewards (RLVR) for LLM training, and teams building or curating multiple-choice QA datasets for training.

Core Mechanics

Demonstrated that the quality of wrong answer options (distractors) in multiple-choice problems directly impacts RLVR training effectiveness
High-quality distractors (plausible, challenging wrong answers) create more informative training signal than easy-to-eliminate distractors
Proposed an automated pipeline (IDG — Intelligent Distractor Generation) to improve distractor quality using LLMs
IDG generates distractors that are wrong but plausible, forcing the model to reason more carefully
Training with IDG-improved data significantly outperforms training on original datasets
Effect is most pronounced for reasoning-heavy tasks where superficial pattern matching is insufficient

Evidence

RLVR training with improved distractors shows significant performance gains on reasoning benchmarks
IDG-generated distractors evaluated by humans as significantly more challenging than original distractors
Ablation confirms distractor quality is the causal factor, not data quantity or other confounds
Benefits hold across multiple model sizes and RLVR training setups

How to Apply

Before running RLVR training on any multiple-choice dataset, run your distractors through an IDG pipeline to upgrade quality
Use an LLM to generate alternative distractors that are: wrong (verifiable), plausible (topic-relevant), and challenging (require reasoning to eliminate)
Combine IDG with difficulty filtering to get a training set where every problem requires genuine reasoning

Code Example

snippet

Terminology

RLVR (RL with Verifiable Rewards)A variant of RLHF where rewards are based on objective verifiable criteria (like answer correctness) rather than human preferences.

DistractorThe wrong answer options in a multiple-choice question. Good distractors are plausible enough to require reasoning to eliminate.

IDG (Intelligent Distractor Generation)The automated pipeline proposed to generate high-quality distractors using LLMs.

Training SignalThe information a training example provides to guide model learning — harder, more discriminative examples provide richer signals.

Original Abstract (Expand)

Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capabilities of Large Language Models. When applied to RLVR, Multiple-Choice Questions (MCQs) offer a scalable source of verifiable data but risk inducing reward hacking, where models shortcut reasoning via random guessing or simple elimination. Current approaches often mitigate this by converting MCQs to open-ended formats, thereby discarding the contrastive signal provided by expert-designed distractors. In this work, we systematically investigate the impact of option design on RLVR. Our analysis highlights two primary insights: (1) Mismatches in option counts between training and testing degrade performance. (2) Strong distractors effectively mitigate random guessing, enabling effective RLVR training even with 2-way questions. Motivated by these findings, we propose Iterative Distractor Curation (IDC), a framework that actively constructs high-quality distractors to block elimination shortcuts and promote deep reasoning. Experiments on various benchmarks demonstrate that our method effectively enhances distractor quality and yields significant gains in RLVR training compared to the original data.