Reinforcement Learning으로 LLM에게 Critique 능력 가르치기: CTRL 프레임워크

Teaching Language Models to Critique via Reinforcement Learning

Feb 5, 2025•Zhihui Xie, Jie Chen, Liyu Chen +3•View PDF

TL;DR Highlight

RL로 훈련된 전용 Critic 모델이 GPT-4o보다 코드 수정 피드백을 더 잘 준다.

Who Should Read

LLM 기반 코드 생성 파이프라인에서 자동 피드백·수정 루프를 구축하려는 ML 엔지니어나 AI 제품 개발자. 특히 test-time compute를 늘려 성능을 끌어올리는 방법을 고민하는 사람.

Core Mechanics

Generator와 Critic을 분리해서 Critic만 따로 RL로 훈련 — Qwen2.5-Coder-32B 기반 Critic을 GPT-4o 같은 더 강한 모델에 붙여도 성능이 오름 (weak-to-strong 일반화)
2단계 훈련: 먼저 코드 실행 결과(sandbox)를 힌트로 SFT, 그다음 GRPO(그룹 상대 정책 최적화)로 RL — PPO는 value network 불안정 때문에 포기
Self-critique는 거의 효과 없음 (7.88% → 8.36%), CTRL은 같은 조건에서 11.76%까지 올리고 오히려 맞던 답을 망치는 비율(∆↓)도 0.85%로 낮게 유지
반복 critique-revision으로 test-time scaling 가능 — 5회 반복 시 CodeContests에서 Pass@1 16.24% (zero-shot 7.88% 대비 +106.1%)
CTRL Critic은 코딩 문제만 훈련했는데 JudgeBench(일반 지식·수학·추론 포함)에서 Claude-3.5-Sonnet과 동급 정확도(64.3%) 달성
Self-critique는 원본 코드를 조금만 바꾸는 경향(유사도 평균 0.482), CTRL은 구조적으로 크게 바꿈(유사도 평균 0.313) — 이게 성능 차이의 원인

Evidence

CodeContests에서 5회 반복 시 Pass@1 16.24% — zero-shot 7.88% 대비 106.1% 상대 개선
GPT-4o를 Generator로 쓸 때도 CTRL Critic이 GPT-4o Critic보다 높은 Pass@1 달성 (25.45% vs 20.61%, 5회 기준)
JudgeBench 코딩 카테고리 정확도: CTRL 65.7%, Claude-3.5-Sonnet 64.3%, GPT-4o 56.6%
Hard 문제에서 6회 반복 시 pass rate +233.3% 개선 (Easy +73.2%, Medium +161.1%)

How to Apply

코드 자동 리뷰 파이프라인에서 별도 Critic 모델을 두고 '분석 → 개선 제안 → 정오 판정' 3단계 구조의 피드백을 생성하게 하면 단순 실행 오류 메시지보다 훨씬 actionable한 수정 유도 가능
샌드박스(코드 실행 환경)가 있다면 실행 결과를 힌트로 넣어 SFT 데이터를 자동 생성하고, 이후 GRPO로 RL 훈련 — 인간 레이블 없이 Critic 모델 구축 가능
Inference 시 Critic을 여러 번 돌리고 majority voting으로 'Correct/Incorrect' 집계하면 reward model처럼 활용 가능 (votes 늘릴수록 정확도 향상 확인됨)

Code Example

snippet

# CTRL 스타일 Critique 프롬프트 템플릿 (논문 Appendix C.2 기반)

CRITIQUE_PROMPT = """
You are tasked with analyzing an answer to a problem and providing constructive feedback.
Do NOT provide direct solutions.

Problem description:
<problem>
{problem}
</problem>

Answer:
<answer>
{answer}
</answer>

Structure your response using the following format:

Analysis:
{{Analysis of strengths and weaknesses}}

Improvement suggestions:
{{Actionable suggestions for improvement}}

Overall judgment: {{Correct/Incorrect}}
"""

# 실행 피드백이 있을 때 힌트 추가 버전
HINTED_CRITIQUE_PROMPT = """
You are tasked with analyzing an answer to a problem and providing constructive feedback.
Do NOT provide direct solutions.
Please carefully reason about the hint to guide the user.
**Important: Do NOT mention 'the hint' in your feedback.**

Problem description:
<problem>
{problem}
</problem>

Answer:
<answer>
{solution}
</answer>

Hint:
<hint>
{hint}  # 예: 실패한 테스트케이스의 input/expected/actual
</hint>

Analysis:
...
Improvement suggestions:
...
Overall judgment: Correct/Incorrect
"""

# Critic 판정 집계 (majority voting)
from collections import Counter

def aggregate_judgments(critiques: list[str]) -> str:
    judgments = []
    for c in critiques:
        if 'Overall judgment: Correct' in c:
            judgments.append('Correct')
        elif 'Overall judgment: Incorrect' in c:
            judgments.append('Incorrect')
    return Counter(judgments).most_common(1)[0][0]

Terminology

GRPOGroup Relative Policy Optimization. RL 훈련 기법인데, 같은 문제에 여러 답변을 뽑아서 서로 비교하는 방식으로 학습 — PPO처럼 별도 value 네트워크가 필요 없어서 더 안정적.

PPOProximal Policy Optimization. 가장 널리 쓰이는 RL 알고리즘 중 하나. RLHF에서도 자주 등장하는데, value 네트워크가 필요해서 복잡한 작업엔 불안정할 수 있음.

SFTSupervised Fine-Tuning. 정답 예시를 보여주고 따라하게 학습시키는 방법. RL 전 단계로 모델에게 '형식'과 '기본 방향'을 먼저 가르치는 데 씀.

Pass@1한 번 생성했을 때 테스트를 통과하는 비율. 코드 생성 품질 지표로 높을수록 좋음.

Test-time Scaling모델 파라미터를 바꾸지 않고, 추론 시 더 많은 계산(반복 수정, 여러 후보 생성 등)을 써서 성능을 올리는 전략. 모델을 다시 학습시키지 않아도 됨.

Weak-to-strong Generalization작은/약한 모델이 더 크고 강한 모델을 효과적으로 감독·지도할 수 있는 현상. OpenAI가 제시한 개념으로, AI 정렬 연구에서 중요하게 다뤄짐.

Compounding Error반복 수정 과정에서 초기 실수가 누적되어 갈수록 더 나빠지는 현상. Critic이 맞는 답을 틀리게 수정하는 것이 반복될수록 심해짐.

Related Resources

https://critic-rl.github.io

Original Abstract (Expand)

Teaching large language models (LLMs) to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provide accurate judgments and actionable suggestions. In this work, we study LLM critics for code generation and propose $\texttt{CTRL}$, a framework for $\texttt{C}$ritic $\texttt{T}$raining via $\texttt{R}$einforcement $\texttt{L}$earning, which trains a critic model to generate feedback that maximizes correction performance for a fixed generator model without human supervision. Our results demonstrate that critics trained with $\texttt{CTRL}$ significantly enhance pass rates and mitigate compounding errors across both base and stronger generator models. Furthermore, we show that these critic models act as accurate generative reward models and enable test-time scaling through iterative critique-revision, achieving up to 106.1% relative improvements across challenging code generation benchmarks.