When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling

TL;DR Highlight

Disagreement-guided routing boosts LLM accuracy on math and code by 3-7% with adaptive problem solving.

Who Should Read

ML engineers deploying inference models (Qwen3, DeepSeek-R1, etc.) for math/coding tasks who want to simultaneously optimize for inference cost and accuracy, and AI researchers exploring test-time compute strategies.

Core Mechanics

The degree of output disagreement when a model samples the same problem multiple times is strongly correlated with problem difficulty and correctness—easy problems yield consistent answers, while difficult problems show high variance.
Majority voting excels at easy problems but degrades on difficult ones, while rewriting (rephrasing the problem) exhibits the opposite pattern—the two must be used selectively.
A three-stage routing process: terminate after two samples if answers agree (NDS), add more samples and majority vote if only one disagreement (MDS), and rewrite the problem and re-infer if two or more disagreements (SDS).
The framework operates without fine-tuning or external reward models—the same base model receives rewriting prompts to re-represent problems and solve them independently.
Cases where rewriting is ineffective are clearly analyzed: answer format errors, context loss during long reasoning, and arithmetic calculation errors are not corrected by rewriting; however, rewriting is highly effective for problems with hidden conditions.
The same strategy applies to code generation (HumanEval, MBPP) as well as mathematical reasoning—disagreement in code is judged by test case execution results instead of textual identity.

Evidence

"On Qwen3-8B, average accuracy across 7 math benchmarks: baseline 61.3% → our method 75.8% (+6.9%p vs. majority voting at 68.9%)."

How to Apply

"When calling the inference API, first sample twice with a temperature of 0.6 and return immediately if the answers are the same; otherwise, sample twice more and majority vote among the four; if still inconsistent, rewrite the problem with a prompt like 'remove unnecessary descriptions while preserving key numbers/symbols' and re-infer—limit to a maximum of 6 calls."

Code Example

snippet

# Core logic: Disagreement-Guided Strategy Routing
import re

REASON_PROMPT = "Please reason step by step, and put your final answer within \\boxed{}."
REWRITE_PROMPT = """Please remove unnecessary descriptions from the following question, \
simplify its length while keeping the original meaning unchanged, \
and retain important numbers and symbols. \
Only provide the revised question without answers or calculations."""

def extract_answer(output: str) -> str:
    """Extract the answer within \\boxed{}"""
    match = re.search(r'\\boxed\{([^}]+)\}', output)
    return match.group(1).strip() if match else output.strip()

def answers_equal(a1: str, a2: str) -> bool:
    """Compare answers after normalization (including numerical, structural, and symbolic equivalence)"""
    return a1.strip() == a2.strip()  # More sophisticated matching is needed in practice

def disagreement_guided_inference(problem: str, model_fn, max_samples: int = 6):
    """
    model_fn: A function that calls the model with a (prompt) -> str interface
    """
    # Stage 1: Disagreement Filter (2 samples)
    out1 = model_fn(f"{REASON_PROMPT}\n\n{problem}")
    out2 = model_fn(f"{REASON_PROMPT}\n\n{problem}")
    a1, a2 = extract_answer(out1), extract_answer(out2)
    
    if answers_equal(a1, a2):
        # NDS: Match → return immediately
        return a1, "NDS", 2
    
    # Stage 2: Vote Resolve (2 more samples)
    out3 = model_fn(f"{REASON_PROMPT}\n\n{problem}")
    out4 = model_fn(f"{REASON_PROMPT}\n\n{problem}")
    a3, a4 = extract_answer(out3), extract_answer(out4)
    
    if answers_equal(a3, a4):
        # MDS: Second match → majority vote (4 samples)
        answers = [a1, a2, a3, a4]
        final = max(set(answers), key=answers.count)
        return final, "MDS", 4
    
    # Stage 3: Rewrite & Rethink (1 rewrite + 1 reasoning)
    rewritten = model_fn(f"{REWRITE_PROMPT}\n\n{problem}")
    out_rewrite = model_fn(f"{REASON_PROMPT}\n\n{rewritten}")
    final_answer = extract_answer(out_rewrite)
    return final_answer, "SDS", 6

# Example usage
# answer, stage, n_samples = disagreement_guided_inference(math_problem, my_llm_fn)
# print(f"Answer: {answer}, Stage: {stage}, Samples used: {n_samples}")

Terminology

Test-Time ScalingA method for increasing accuracy at inference time by investing more computation without retraining the model. Similar to thinking longer during an exam.

Majority VotingA method of adopting the most frequent answer when solving the same problem multiple times. Equivalent to taking a poll and choosing the majority.

LRMLarge Reasoning Model. A large language model specialized for tasks requiring logical reasoning, such as math or coding. DeepSeek-R1 and the Qwen3 series are examples.

NDS/MDS/SDSNo/Minor/Severe Disagreement Samples. A categorization of problems based on the degree of disagreement in model outputs. NDS represents easy problems, while SDS represents difficult problems where the model consistently provides different answers.

RewritingChanging the way a problem is expressed while preserving its original meaning. Rephrasing a math problem can allow the model to approach it from a different reasoning path and escape previous errors.

Best-of-N (BoN)A method where an external reward model selects the best answer after N samples. Equivalent to having a separate panel of judges.

Output DisagreementThe extent to which answers differ when a model is run multiple times on the same problem. High disagreement signals that the model finds the problem difficult.

Related Papers

Related Resources

Original Abstract (Expand)

Large Reasoning Models (LRMs) achieve strong performance on mathematical reasoning tasks but remain unreliable on challenging instances. Existing test-time scaling methods, such as repeated sampling, self-correction, and tree search, improve performance at the cost of increased computation, yet often exhibit diminishing returns on hard problems. We observe that output disagreement is strongly correlated with instance difficulty and prediction correctness, providing a useful signal for guiding instance-level strategy selection at test time. Based on this insight, we propose a training-free framework that formulates test-time scaling as an instance-level routing problem, rather than allocating more computation within a single strategy, dynamically selecting among different scaling strategies based on output disagreement. The framework applies lightweight resolution for consistent cases, majority voting for moderate disagreement, and rewriting-based reformulation for highly ambiguous instances. Experiments on seven mathematical benchmarks and three models show that our method improves accuracy by 3% - 7% while reducing sampling cost compared to existing approaches.