언제 투표하고 언제 다시 쓸까: Disagreement 기반 Test-Time Scaling 전략 라우팅

TL;DR Highlight

모델 출력이 얼마나 일치하는지 보고 쉬운 문제엔 majority voting, 어려운 문제엔 문제 rewriting을 자동으로 선택해 정확도 3~7% 올리고 샘플링 비용도 줄이는 학습 불필요 프레임워크.

Who Should Read

추론 모델(Qwen3, DeepSeek-R1 등)을 수학/코딩 태스크에 배포하면서 inference 비용과 정확도를 동시에 잡고 싶은 ML 엔지니어. test-time compute 전략을 고민하는 AI 리서처.

Core Mechanics

모델이 같은 문제를 여러 번 샘플링했을 때 답이 얼마나 다른지(output disagreement)가 문제 난이도 및 정답 여부와 강하게 상관됨 — 쉬운 문제는 답이 일치하고, 어려운 문제는 분산이 큼.
Majority voting은 쉬운 문제에서 강하지만 어려운 문제에서 오히려 성능이 떨어지고, rewriting(문제를 다르게 표현)은 정반대 패턴을 보임 — 둘을 상황에 따라 골라써야 함.
3단계 라우팅: 2번 샘플링 후 답이 같으면 바로 종료(NDS), 1번만 불일치면 추가 샘플링 후 majority vote(MDS), 2번 이상 불일치면 문제를 rewriting 후 재추론(SDS).
파인튜닝 없이, 외부 reward 모델 없이 동작 — 같은 베이스 모델이 rewriting 프롬프트를 받아 문제를 재표현하고 스스로 다시 풀음.
rewriting이 효과 없는 케이스도 명확히 분석: 답안 포맷 오류, 긴 추론 중 맥락 손실, 산술 계산 오류는 rewriting으로 고쳐지지 않음. 반면 숨겨진 조건이 있는 문제에선 rewriting이 매우 효과적.
수학 추론뿐 아니라 코드 생성(HumanEval, MBPP)에도 동일 전략 적용 가능 — 코드는 텍스트 동일성 대신 테스트케이스 실행 결과로 disagreement 판단.

Evidence

Qwen3-8B 기준 7개 수학 벤치마크 평균 정확도: baseline 61.3% → 우리 방법 75.8% (majority voting 68.9% 대비 +6.9%p).
샘플링 횟수: Math500에서 최대 3000회 대비 우리 방법은 1572회(40.5% 수준), Olympiad에서 4050회 대비 2440회(60.2% 수준)로 비용 절감.
Qwen3-8B 기준 Math500 평균 토큰/시간: Majority 26.4K토큰·15.5초 vs 우리 방법 11.6K토큰·5.9초 (토큰 56% 절감, 시간 62% 절감).
코드 생성에서도 Qwen2.5-Coder-7B-Instruct 평균 정확도 81.0%(Majority) → 83.5%(우리 방법), DeepSeek-R1-Distill-Llama-8B는 37.8% → 42.7%로 개선.

How to Apply

추론 API 호출 시 먼저 temperature 0.6으로 2번 샘플링해서 답이 같으면 바로 리턴, 다르면 2번 더 샘플링해서 4개 중 majority vote, 그래도 불일치하면 'unnecessary description 제거 후 key number/symbol 보존' 프롬프트로 문제를 rewriting한 뒤 재추론 — 최대 6번 호출로 제한.
코드 생성 파이프라인에서 두 코드 결과가 같은지 판단할 때 string 비교 대신 동일 테스트케이스 실행 결과로 비교하면 functional equivalence 기반 disagreement 감지가 가능.
쉬운 벤치마크(GSM8K 등)에서 모델 정확도가 이미 95% 이상이면 SDS 비율이 1~2%에 불과해 이 방법의 효과가 미미함 — 모델이 어려워하는 벤치마크(Olympiad, AIME)에서 적용할 때 효과 극대화.

Code Example

snippet

# 핵심 로직: Disagreement-Guided Strategy Routing
import re

REASON_PROMPT = "Please reason step by step, and put your final answer within \\boxed{}."
REWRITE_PROMPT = """Please remove unnecessary descriptions from the following question, \
simplify its length while keeping the original meaning unchanged, \
and retain important numbers and symbols. \
Only provide the revised question without answers or calculations."""

def extract_answer(output: str) -> str:
    """\\boxed{} 안의 답 추출"""
    match = re.search(r'\\boxed\{([^}]+)\}', output)
    return match.group(1).strip() if match else output.strip()

def answers_equal(a1: str, a2: str) -> bool:
    """정규화 후 답 비교 (숫자, 구조, 심볼 동치 포함)"""
    return a1.strip() == a2.strip()  # 실제로는 더 정교한 매칭 필요

def disagreement_guided_inference(problem: str, model_fn, max_samples: int = 6):
    """
    model_fn: (prompt) -> str 형태의 모델 호출 함수
    """
    # Stage 1: Disagreement Filter (2 samples)
    out1 = model_fn(f"{REASON_PROMPT}\n\n{problem}")
    out2 = model_fn(f"{REASON_PROMPT}\n\n{problem}")
    a1, a2 = extract_answer(out1), extract_answer(out2)
    
    if answers_equal(a1, a2):
        # NDS: 일치 → 바로 리턴
        return a1, "NDS", 2
    
    # Stage 2: Vote Resolve (2 more samples)
    out3 = model_fn(f"{REASON_PROMPT}\n\n{problem}")
    out4 = model_fn(f"{REASON_PROMPT}\n\n{problem}")
    a3, a4 = extract_answer(out3), extract_answer(out4)
    
    if answers_equal(a3, a4):
        # MDS: 2차 일치 → majority vote (4개)
        answers = [a1, a2, a3, a4]
        final = max(set(answers), key=answers.count)
        return final, "MDS", 4
    
    # Stage 3: Rewrite & Rethink (1 rewrite + 1 reasoning)
    rewritten = model_fn(f"{REWRITE_PROMPT}\n\n{problem}")
    out_rewrite = model_fn(f"{REASON_PROMPT}\n\n{rewritten}")
    final_answer = extract_answer(out_rewrite)
    return final_answer, "SDS", 6

# 사용 예시
# answer, stage, n_samples = disagreement_guided_inference(math_problem, my_llm_fn)
# print(f"Answer: {answer}, Stage: {stage}, Samples used: {n_samples}")

Terminology

Test-Time Scaling모델을 다시 학습시키지 않고, 추론(inference) 단계에서 계산을 더 투입해 정확도를 높이는 방법. 시험 중에 더 오래 생각하는 것과 비슷.

Majority Voting같은 문제를 여러 번 풀어서 가장 많이 나온 답을 최종 답으로 채택하는 방식. 여론조사로 다수결 내리는 것과 같음.

LRMLarge Reasoning Model. 수학이나 코딩처럼 논리적 추론이 필요한 태스크에 특화된 대형 언어 모델. DeepSeek-R1, Qwen3 시리즈 등이 여기 해당.

NDS/MDS/SDSNo/Minor/Severe Disagreement Samples. 모델 출력 불일치 정도에 따라 문제를 세 그룹으로 나눈 것. NDS는 쉬운 문제, SDS는 모델이 계속 다른 답을 내는 어려운 문제.

Rewriting원래 문제의 의미는 유지하면서 표현 방식을 바꾸는 것. 같은 수학 문제를 다른 말로 설명하면 모델이 다른 추론 경로로 접근해 기존 오류에서 탈출할 수 있음.

Best-of-N (BoN)N번 샘플링 후 외부 reward 모델이 가장 좋은 답을 골라주는 방식. 심사위원을 별도로 두는 것과 같음.

Output Disagreement동일 문제에 대해 모델이 여러 번 실행했을 때 나오는 답들이 서로 다른 정도. 불일치가 클수록 모델이 해당 문제를 어렵게 느낀다는 신호.

Related Resources

Original Abstract (Expand)

Large Reasoning Models (LRMs) achieve strong performance on mathematical reasoning tasks but remain unreliable on challenging instances. Existing test-time scaling methods, such as repeated sampling, self-correction, and tree search, improve performance at the cost of increased computation, yet often exhibit diminishing returns on hard problems. We observe that output disagreement is strongly correlated with instance difficulty and prediction correctness, providing a useful signal for guiding instance-level strategy selection at test time. Based on this insight, we propose a training-free framework that formulates test-time scaling as an instance-level routing problem, rather than allocating more computation within a single strategy, dynamically selecting among different scaling strategies based on output disagreement. The framework applies lightweight resolution for consistent cases, majority voting for moderate disagreement, and rewriting-based reformulation for highly ambiguous instances. Experiments on seven mathematical benchmarks and three models show that our method improves accuracy by 3% - 7% while reducing sampling cost compared to existing approaches.