Self-Distillation에서 Feedback Alignment의 역할

TL;DR Highlight

LLM이 스스로를 가르칠 때, 피드백을 모델의 추론 흐름에 단계별로 맞추면 GRPO보다 16점 이상 수학 추론 성능이 오른다.

Who Should Read

LLM 수학 추론 성능을 올리려는 ML 엔지니어, 또는 RLHF/GRPO 대신 더 나은 학습 신호를 찾고 있는 파인튜닝 개발자.

Core Mechanics

Self-distillation(자기 자신을 교사로 삼아 증류하는 기법)은 같은 모델을 학생(질문만 봄)과 교사(질문 + 추가 컨텍스트를 봄) 두 역할로 나눠서 학습한다. 핵심은 교사에게 어떤 컨텍스트를 줄 것인가다.
세 가지 피드백 방식을 비교했다: 이진 보상만 쓰는 GRPO, 참조 풀이(RefSol)를 컨텍스트로 주는 방식, 그리고 학생의 추론 단계에 맞춰 단계별로 피드백을 주는 StepAlignFB.
StepAlignFB는 Qwen/QwQ-32B(frozen critic)가 학생 풀이의 정답 단계는 그대로 복사하고, 오답 단계만 수정하는 '성실한 서기(faithful scribe)' 방식으로 피드백을 생성한다.
참조 풀이(RefSol)를 컨텍스트로 주면, 학생이 맞게 푼 단계에서도 표현 방식이 달라서 거의 모든 토큰에 부정적 gradient가 퍼진다. 오답만 틀렸는데 전체 풀이 방식을 다 바꾸라는 압력이 생기는 셈이다.
StepAlignFB는 PRM(Process Reward Model, 각 추론 단계마다 보상을 주는 모델)처럼 오류 토큰 근처에만 학습 신호를 집중시킨다. 별도의 PRM 학습이나 단계별 라벨링 없이도 이 효과를 얻는다.
정답 단계를 그대로 verbatim 복사하면 induction head(문맥에서 본 패턴을 반복하는 어텐션 메커니즘)가 활성화되어 올바른 추론을 강화하고, 오답 단계는 복사하지 않아 critic의 수정이 그대로 반영된다.

Evidence

StepAlignFB는 Avg@12 기준 GRPO 대비 +16.11점, RefSol 대비 +5.27점 우위 (각각 35.83 vs 19.72, 35.83 vs 30.56).
Majority-Vote@12에서 StepAlignFB 56.67 vs RefSol 43.33 vs GRPO 26.67으로, 정답에 확률 질량이 더 집중되는 효과가 확인됨.
Pass@12(12번 중 한 번이라도 맞출 확률)는 StepAlignFB 90.00, RefSol 86.67, GRPO 76.67로 StepAlignFB가 최고.
실험은 Qwen3-1.7B solver, OpenMathReasoning 데이터셋(난이도 필터링 후 282 train / 30 test), 4×H100 GPU, 최대 7 epoch 학습 조건에서 수행.

How to Apply

수학/코딩 추론 파인튜닝 파이프라인에서 GRPO를 쓰고 있다면, 강한 모델(예: QwQ-32B)을 frozen critic으로 두고 학생 풀이의 단계별 피드백을 생성하게 한 뒤 self-distillation 컨텍스트로 주면 된다. 핵심은 정답 단계를 그대로 복사하고 오답 단계만 수정하도록 critic 프롬프트를 설계하는 것.
Critic 프롬프트를 설계할 때 '정답 단계는 학생 표현 그대로 verbatim 복사, 오답 단계는 학생 스타일을 유지하면서 수정'하는 faithful-scribe 규칙을 따르면 된다. 논문의 Appendix A.1에 전체 critic 프롬프트 템플릿이 공개되어 있어 바로 재현 가능하다.
참조 풀이만 있는 데이터셋을 쓰는 경우, 참조 풀이를 직접 teacher 컨텍스트로 쓰기보다 강한 모델에게 '학생 풀이와 참조 풀이를 비교해서 단계별 critique 생성'을 시키는 중간 단계를 추가하면 성능이 더 오른다.

Code Example

snippet

# StepAlignFB Critic 프롬프트 핵심 구조 (Appendix A.1 기반)

CRITIC_PROMPT = """
You are a math grader producing feedback to a student's solution.
Your default behavior is faithful scribe: when the student's work is correct up to
some point, reproduce that portion. The exceptions:
- In Case D, at the erroneous step only, replace the student's claim with the correct one.
- In Case C, you may compress the student's correct steps because the student ran out of room.

# Decision procedure
Step 1: Classify into ONE case.
- No final answer -> Case C
- Boxed answer doesn't match reference -> Case D
- Boxed answer matches but a non-routine step is unjustified/invalid -> Case B
- Else -> Case A

Step 2: Identify the pivotal step N.
Step 3: Output the matching schema. No preamble, no postamble.

# Hard rules
1. Output ONE schema only.
2. Summary block: one line per student step 'Step <i>: Correct' or 'Step <i>: Incorrect -- <phrase>'.
3. In Cases A, B, D: reproduce correct steps VERBATIM (same notation, same wording).
4. In Case D, at step N ONLY: replace incorrect claim with correct claim in student's style.

Problem: {problem}
Reference answer: {reference_solution}
Student's solution: {student_sol}
"""

# Teacher 프롬프트 (StepAlignFB)
TEACHER_PROMPT = """
Question: {problem}

Expert feedback on a prior attempt at this problem is given below.
The feedback diagnoses where the attempt went wrong (by step number)
and carries the corrected continuation.

Expert feedback:
{expert_critique}

Instructions:
- Produce a fresh, self-contained solution to the original problem.
- Use the feedback only to ensure correctness; do not mention or refer to it.
Let's think step by step and produce a final answer in the format \\boxed{}.
"""

# Self-distillation advantage 계산 (핵심 수식)
# A_SD_t = log π(y_t | x, context, y_<t) - log π(y_t | x, y_<t)
# context가 step-aligned feedback일 때 오답 토큰 근처에서만 크게 음수가 됨

Terminology

Self-distillation외부 교사 모델 없이 자기 자신을 교사로 삼아 학습하는 방법. 같은 모델을 컨텍스트 없는 학생과 컨텍스트 있는 교사로 나눠서 두 출력 분포를 맞추도록 훈련함.

GRPO여러 번 답변을 샘플링하고 맞으면 +1, 틀리면 -1로 점수를 매겨 학습하는 강화학습 방법. 맞고 틀림만 알 뿐 '어디서 틀렸는지'는 모름.

PRMProcess Reward Model. 최종 답이 맞는지가 아니라 추론의 각 단계가 올바른지를 평가하는 모델. 단계별 점수를 매겨 더 정밀한 학습 신호를 줌.

induction head트랜스포머 내부의 어텐션 패턴 중 하나. 앞에서 봤던 토큰 시퀀스가 다시 나오면 그 다음에 왔던 토큰을 높은 확률로 예측하도록 작동함. 복사/반복 행동의 메커니즘.

per-token advantage강화학습에서 각 토큰이 얼마나 좋거나 나쁜 선택이었는지를 나타내는 값. 일반 RL은 전체 답변에 하나의 값을 쓰지만, self-distillation은 토큰마다 다른 값을 씀.

KL divergence두 확률 분포가 얼마나 다른지를 측정하는 척도. Self-distillation에서 학생 분포와 교사 분포의 차이를 줄이는 데 사용됨.

LoRA모델 전체를 다 바꾸지 않고 작은 행렬 어댑터만 끼워서 학습하는 파인튜닝 기법. 훨씬 적은 메모리와 시간으로 모델을 특정 태스크에 맞출 수 있음.

forward KL학습할 때 교사 분포가 높게 보는 곳은 학생도 높게, 교사가 0인 곳은 학생도 0으로 맞추려는 방식. 교사가 선호하는 모든 답변 모드를 커버하려 함.

Related Resources

Original Abstract (Expand)

Conditioning a language model on additional context, such as feedback on a previous attempt, typically improves its response. Self-distillation trains the model to retain this improvement when the context is not present. The method works by matching the model's output distribution under two settings: a student that sees only the question, and a self-teacher that also sees the context. What the model learns therefore depends on what context the self-teacher receives, yet the design of this context remains largely unexplored. We study context design for self-distillation by training a solver on feedback from a frozen critic. We compare three conditions: (i) a binary reward (GRPO), (ii) the reference solution, and (iii) a step-by-step critique aligned to the solver's reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution-conditioned self-distillation by 5.27 points (Avg@12). Per-token advantage analysis reveals why: step-aligned feedback targets only the tokens where reasoning fails, leaving correct behavior intact. Conditioning on the reference solution, by contrast, pressures the model to change its behavior at every token (even correct steps) because an alternative derivation inevitably differs in phrasing and approach. This suggests that structural alignment between feedback and the solver's reasoning is a key driver of self-distillation effectiveness.