OpenDeepThink: Parallel Reasoning via Bradley--Terry Aggregation | AI Paper Digest

TL;DR Highlight

LLM 여러 답안을 토너먼트 방식으로 비교·진화시켜 외부 검증기 없이도 경쟁 프로그래밍 Elo를 +405 올린 프레임워크

Who Should Read

LLM 추론 성능을 높이려는 ML 엔지니어나 AI 연구자. 특히 Best-of-N 샘플링에서 '어떤 답이 가장 좋은지 고르는 문제'를 해결하려는 사람.

Core Mechanics

기존 추론 스케일링은 하나의 체인을 깊게 늘리는 방식(depth)인데, OpenDeepThink는 여러 후보를 병렬로 생성하고 진화시키는 breadth 방식을 택함.
핵심 아이디어는 Bradley–Terry 모델(여러 1:1 비교 결과를 전체 랭킹으로 변환하는 통계 기법)로 후보들을 서열화해서, 외부 정답 검증기 없이도 가장 좋은 답을 고를 수 있음.
LLM이 두 답을 직접 비교(pairwise)하면 개별 점수 매기기(pointwise)보다 훨씬 정확함 — 500쌍 진단 테스트에서 pairwise 86% vs pointwise 59% 정확도.
한 세대마다 하위 25%는 버리고, 상위 25%는 엘리트로 보존, 나머지 75%는 비교 과정에서 나온 자연어 피드백으로 변이(mutation)시키는 진화 루프를 돌림.
변이에서 중요한 건 '뭐가 틀렸는지' 네거티브 피드백임. 포지티브 피드백만 주는 건 피드백 없는 것과 성능 차이가 거의 없고, 네거티브 피드백이 실질적인 개선을 이끎.
HLE(다분야 어려운 문제) 벤치마크에서는 수학/물리 같은 객관적 도메인은 성능이 오르지만, 인문학/사회과학 같은 주관적 도메인에서는 오히려 성능이 떨어짐 — pairwise 판단 신뢰도가 프레임워크 효과를 결정함.

Evidence

Gemini 3.1 Pro 기준, 단순 랜덤 샘플링 대비 Codeforces Elo +405점 향상 (2851 → 3256). 이는 Gemini 3 Deep Think가 Gemini 3.1 Pro 대비 달성한 +411과 비슷한 수준.
Hard 난이도 문제(64개)에서 pass@1이 11% → 36%로 상승, BT top-1은 23% → 50%로 상승. 95% 신뢰구간 기준 +16~+39pp 향상.
pairwise 판단 정확도 86.2% vs pointwise 59.2% (500쌍 진단, NOI-119 문제 기반). Pointwise는 정답(AC) 식별은 96.4%로 높지만 오답(WA) 식별이 62.2%에 불과해 선택 품질이 낮음.
Self-Refine 6라운드 + pointwise 투표 8회(총 300 API 호출) 대비, OpenDeepThink(285 호출)는 Hard 문제 top-1 정확도 50% vs 41%로 9pp 우위. 비슷한 예산에서 선택 메커니즘 차이만으로 발생한 순수 선택 신호.

How to Apply

Best-of-N 샘플링을 쓰고 있는데 '어떤 답이 최선인지' 고르기 어렵다면, 각 후보 쌍을 LLM에게 비교시키고 Bradley–Terry로 랭킹을 뽑아 top-1을 제출하는 방식으로 바꿔볼 수 있음. 외부 검증기 없이도 단순 다수결보다 훨씬 나은 선택이 가능함.
코드 생성이나 수학 문제 풀이 파이프라인에서 여러 후보를 생성한 뒤, 논문 제공 프롬프트(pairwise comparison + mutation 템플릿)를 그대로 사용해 진화 루프를 구현할 수 있음. GitHub 코드가 공개되어 있어 바로 적용 가능.
주관적 판단이 필요한 도메인(창작, 요약 등)에는 이 방법을 쓰면 오히려 성능이 떨어질 수 있음. 객관적으로 정답이 있는 도메인(코딩, 수학, 과학)에만 적용하고, 적용 전 pairwise 판단 정확도를 먼저 검증하는 게 좋음.

Code Example

snippet

# 논문에서 제공한 pairwise comparison 프롬프트 템플릿
comparison_prompt = """
You are a competitive programming expert.

## Problem Statement
{problem}

## Solution A
```cpp
{code_a}
```

## Solution B
```cpp
{code_b}
```

Which solution is more likely to receive an Accepted verdict from an online judge ---
meaning it produces correct output within the time and memory limits for all valid inputs?

If both solutions appear incorrect (wrong answer, TLE, or other issues), choose the
one that requires fewer modifications to become Accepted.

If they are fundamentally identical or equally likely to be Accepted, output TIE.

Respond with a JSON object and nothing else, in exactly this format:
{
  "feedback_a": "one sentence on Solution A's key strength or critical flaw",
  "feedback_b": "one sentence on Solution B's key strength or critical flaw",
  "winner": "A or B or TIE"
}
"""

# mutation 프롬프트 (네거티브 피드백 포함)
mutation_prompt = """
## Problem
{problem}

## Solution
```cpp
{code}
```

## Pairwise Feedback
This solution was compared against other solutions multiple times:

### Wins (this solution was judged better):
- {wins_feedback}

### Losses (this solution was judged worse):
- {losses_feedback}

## Task
Write a solution that maximizes the probability of Accepted.
You may refine the existing solution or take a different approach
if the current one is fundamentally flawed.

Think briefly, then output your final solution as a single ```cpp ... ``` block.
"""

# Bradley-Terry 점수 계산 (scipy 사용)
from scipy.optimize import minimize
import numpy as np

def fit_bradley_terry(comparisons, n_candidates, lambda_reg=0.01):
    """
    comparisons: list of (i, j, outcome) where outcome=1 if i wins, 0 if j wins, 0.5 if tie
    n_candidates: number of candidates
    returns: score vector s of shape (n_candidates,)
    """
    def neg_log_likelihood(s):
        loss = 0
        for i, j, outcome in comparisons:
            diff = s[i] - s[j]
            log_prob = outcome * diff - np.log(1 + np.exp(diff))
            loss -= log_prob
        loss += 0.5 * lambda_reg * np.sum(s**2)  # L2 regularization
        return loss
    
    s0 = np.zeros(n_candidates)
    result = minimize(neg_log_likelihood, s0, method='L-BFGS-B')
    return result.x

# 사용 예시
# comparisons = [(0, 1, 1), (1, 2, 0.5), (0, 2, 1)]  # (후보i, 후보j, 결과)
# scores = fit_bradley_terry(comparisons, n_candidates=3)
# best_candidate_idx = np.argmax(scores)

Terminology

Bradley–Terry model여러 1:1 대결 결과를 모아서 전체 순위를 통계적으로 추정하는 기법. 스포츠 리그에서 각 팀의 실력을 경기 결과만 보고 랭킹 매기는 것과 같은 원리.

pairwise comparison두 후보를 직접 비교해서 어느 쪽이 더 나은지 판단하는 방식. '이 답 점수가 몇 점이야?'가 아니라 '이 답과 저 답 중 어느 게 나아?'라고 묻는 것.

pointwise scoring각 후보를 독립적으로 절대 점수로 평가하는 방식. LLM이 하나씩 보고 '이 답은 100점 만점에 75점' 식으로 매기는 것인데, 양의 편향(좋다고 과대평가)이 심한 문제가 있음.

test-time compute scaling모델을 더 크게 만들거나 재학습하지 않고, 추론(답 생성) 시간에 더 많은 계산을 써서 성능을 높이는 전략. o1이나 DeepSeek-R1이 대표적.

Best-of-N sampling같은 문제에 대해 N개의 답을 생성하고 그 중 가장 좋은 것을 고르는 방법. 문제는 '가장 좋은 것'을 어떻게 고르느냐임.

elite preservation진화 알고리즘에서 상위 후보를 그대로 다음 세대에 살아남게 하는 전략. 좋은 해를 우연히 더 나쁜 방향으로 변이시키지 않도록 보존하는 것.

Codeforces Elo알고리즘 경쟁 플랫폼 Codeforces에서 사용하는 실력 점수 체계. 체스 Elo와 유사하게 문제 난이도와 풀이 결과를 바탕으로 계산되며, 높을수록 실력이 좋음.

HLE (Humanity's Last Exam)수학, 과학, 인문학 등 다양한 분야에서 인간 전문가도 풀기 어려운 초고난도 문제들로 구성된 LLM 벤치마크.

Related Papers

Related Resources

Original Abstract (Expand)

Test-time compute scaling is a primary axis for improving LLM reasoning. Existing methods primarily scale depth by extending a single reasoning trace. Scaling breadth by sampling multiple candidates in parallel is straightforward, but introduces a selection bottleneck: choosing the best candidate without a ground-truth verifier, since pointwise LLM judging is noisy and biased. To address this, we introduce OpenDeepThink, a population-based test-time compute framework that selects via pairwise Bradley-Terry comparison. Each generation, the LLM judges random pairs of candidates and aggregates votes via Bradley-Terry into a global ranking; top-ranked candidates are preserved and the top three quarters are mutated using the natural-language critiques produced during comparison; the bottom quarter is discarded. OpenDeepThink raises Gemini 3.1 Pro's effective Codeforces Elo by +405 points in eight sequential LLM-call rounds (~27 minutes wall-clock). The pipeline transfers across weaker and stronger models without retuning, and on the multi-domain HLE benchmark, gains appear concentrated in objectively verifiable domains and reverse in subjective ones. We release CF-73, a curated set of 73 expert-rated Codeforces problems with International Grandmaster annotation and 99% local-evaluation agreement against the official verdict.