When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling
TL;DR Highlight
Disagreement-guided routing boosts LLM accuracy on math and code by 3-7% with adaptive problem solving.
Who Should Read
ML engineers deploying inference models (Qwen3, DeepSeek-R1, etc.) for math/coding tasks who want to simultaneously optimize for inference cost and accuracy, and AI researchers exploring test-time compute strategies.
Core Mechanics
- The degree of output disagreement when a model samples the same problem multiple times is strongly correlated with problem difficulty and correctness—easy problems yield consistent answers, while difficult problems show high variance.
- Majority voting excels at easy problems but degrades on difficult ones, while rewriting (rephrasing the problem) exhibits the opposite pattern—the two must be used selectively.
- A three-stage routing process: terminate after two samples if answers agree (NDS), add more samples and majority vote if only one disagreement (MDS), and rewrite the problem and re-infer if two or more disagreements (SDS).
- The framework operates without fine-tuning or external reward models—the same base model receives rewriting prompts to re-represent problems and solve them independently.
- Cases where rewriting is ineffective are clearly analyzed: answer format errors, context loss during long reasoning, and arithmetic calculation errors are not corrected by rewriting; however, rewriting is highly effective for problems with hidden conditions.
- The same strategy applies to code generation (HumanEval, MBPP) as well as mathematical reasoning—disagreement in code is judged by test case execution results instead of textual identity.
Evidence
- "On Qwen3-8B, average accuracy across 7 math benchmarks: baseline 61.3% → our method 75.8% (+6.9%p vs. majority voting at 68.9%)."
How to Apply
- "When calling the inference API, first sample twice with a temperature of 0.6 and return immediately if the answers are the same; otherwise, sample twice more and majority vote among the four; if still inconsistent, rewrite the problem with a prompt like 'remove unnecessary descriptions while preserving key numbers/symbols' and re-infer—limit to a maximum of 6 calls."
Code Example
# Core logic: Disagreement-Guided Strategy Routing
import re
REASON_PROMPT = "Please reason step by step, and put your final answer within \\boxed{}."
REWRITE_PROMPT = """Please remove unnecessary descriptions from the following question, \
simplify its length while keeping the original meaning unchanged, \
and retain important numbers and symbols. \
Only provide the revised question without answers or calculations."""
def extract_answer(output: str) -> str:
"""Extract the answer within \\boxed{}"""
match = re.search(r'\\boxed\{([^}]+)\}', output)
return match.group(1).strip() if match else output.strip()
def answers_equal(a1: str, a2: str) -> bool:
"""Compare answers after normalization (including numerical, structural, and symbolic equivalence)"""
return a1.strip() == a2.strip() # More sophisticated matching is needed in practice
def disagreement_guided_inference(problem: str, model_fn, max_samples: int = 6):
"""
model_fn: A function that calls the model with a (prompt) -> str interface
"""
# Stage 1: Disagreement Filter (2 samples)
out1 = model_fn(f"{REASON_PROMPT}\n\n{problem}")
out2 = model_fn(f"{REASON_PROMPT}\n\n{problem}")
a1, a2 = extract_answer(out1), extract_answer(out2)
if answers_equal(a1, a2):
# NDS: Match → return immediately
return a1, "NDS", 2
# Stage 2: Vote Resolve (2 more samples)
out3 = model_fn(f"{REASON_PROMPT}\n\n{problem}")
out4 = model_fn(f"{REASON_PROMPT}\n\n{problem}")
a3, a4 = extract_answer(out3), extract_answer(out4)
if answers_equal(a3, a4):
# MDS: Second match → majority vote (4 samples)
answers = [a1, a2, a3, a4]
final = max(set(answers), key=answers.count)
return final, "MDS", 4
# Stage 3: Rewrite & Rethink (1 rewrite + 1 reasoning)
rewritten = model_fn(f"{REWRITE_PROMPT}\n\n{problem}")
out_rewrite = model_fn(f"{REASON_PROMPT}\n\n{rewritten}")
final_answer = extract_answer(out_rewrite)
return final_answer, "SDS", 6
# Example usage
# answer, stage, n_samples = disagreement_guided_inference(math_problem, my_llm_fn)
# print(f"Answer: {answer}, Stage: {stage}, Samples used: {n_samples}")Terminology
Related Papers
Less Is More: Engineering Challenges of On-Device Small Language Model Integration in a Mobile Application
Five failure modes and eight practical solutions emerged after five days of running on-device SLMs (Gemma 4 E2B, Qwen3 0.6B) with Wordle.
Dynamic Context Evolution for Scalable Synthetic Data Generation
A framework that completely eliminates duplication and repetition in large-scale synthetic data generation with LLMs using three mechanisms (VTS + Semantic Memory + Adaptive Prompt).
90%+ fewer tokens per session by reading a pre-compiled wiki instead of exploring files cold. Built from Karpathy's workflow.
This is a workflow sharing post about how pre-organizing a codebase in Wiki format can reduce token usage per Claude session by more than 90% instead of directly exploring the codebase every time.
I mass deleted 3 months of AI generated code last week. Here is what I learned.
A retrospective post by a developer who deleted 3 months' worth of code after over-relying on AI code generation, but access to the original post is blocked, making it impossible to verify the actual content.
This new technique saves 60% of my token expenses
You can reduce LLM response tokens by 60% by using a telegraphic style that only keeps nouns and verbs, excluding articles, conjunctions, and auxiliary verbs.
Related Resources
Original Abstract (Expand)
Large Reasoning Models (LRMs) achieve strong performance on mathematical reasoning tasks but remain unreliable on challenging instances. Existing test-time scaling methods, such as repeated sampling, self-correction, and tree search, improve performance at the cost of increased computation, yet often exhibit diminishing returns on hard problems. We observe that output disagreement is strongly correlated with instance difficulty and prediction correctness, providing a useful signal for guiding instance-level strategy selection at test time. Based on this insight, we propose a training-free framework that formulates test-time scaling as an instance-level routing problem, rather than allocating more computation within a single strategy, dynamically selecting among different scaling strategies based on output disagreement. The framework applies lightweight resolution for consistent cases, majority voting for moderate disagreement, and rewriting-based reformulation for highly ambiguous instances. Experiments on seven mathematical benchmarks and three models show that our method improves accuracy by 3% - 7% while reducing sampling cost compared to existing approaches.