Nemotron-Cascade 2: Cascade RL과 Multi-Domain On-Policy Distillation로 LLM Post-Training하기

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

Mar 19, 2026•Zhuolin Yang, Zihan Liu, Yang Chen +14•View PDF

TL;DR Highlight

30B MoE 모델로 IMO·IOI 2025 금메달 달성한 NVIDIA의 오픈소스 추론 특화 모델 훈련 레시피 공개

Who Should Read

강화학습 기반 LLM post-training 파이프라인을 설계하거나 수학·코딩 특화 모델을 파인튜닝하려는 ML 엔지니어. 오픈소스 추론 모델의 학습 전략을 실무에 적용하고 싶은 AI 연구자.

Core Mechanics

Cascade RL: 도메인별로 순차적으로 RL을 적용하는 방식 — IF-RL → Multi-domain RL → MOPD → RLHF → Long-context RL → Code RL → SWE RL 순서로 훈련해서 catastrophic forgetting 최소화
Multi-domain On-Policy Distillation(MOPD): Cascade RL 중간 체크포인트들을 도메인별 teacher로 활용해서 학생 모델에 지식 증류 — GRPO 대비 같은 스텝에서 더 높은 성능 달성(AIME25에서 25스텝 GRPO 91.0 vs 30스텝 MOPD 92.0)
30B-A3B(활성 파라미터 3B)짜리 MoE 모델이 DeepSeek-V3.2-Speciale-671B-A37B에 이어 IMO·IOI 2025 모두 금메달 획득 — 파라미터 20배 적은데 동급 성능
KL divergence 항을 완전 제거하고 순수 on-policy GRPO(REINFORCE 목적함수)로 훈련 — 학습 안정성 향상 및 entropy collapse 방지
SFT 데이터는 수학 1.8M+2.6M, 코드 1.9M Python+1.0M C++, 과학 2.7M 등 광범위한 도메인 데이터를 DeepSeek-V3.2-Speciale, GPT-OSS-120B 등으로 생성해서 구축
Agentless RL이 Agentic 태스크에도 도움됨: agentless+agentic 혼합 SFT 시 OpenHands Pass@1 48.9→49.9, Pass@4 62.8→65.2

Evidence

LiveCodeBench v6에서 87.2%(TIR 88.4%) — Qwen3.5-35B-A3B 74.6%, Nemotron-3-Super-120B-A12B 78.7% 대비 압도
ArenaHard v2에서 MOPD 52스텝 만에 Hard Prompt 71.5→85.5, Creative Writing 40.6→71.0 달성 — RLHF는 160스텝 걸려서 80.7/71.2
IMO 2025 5문제 해결(35/42점) 금메달, IOI 2025 439.28/600점 금메달, ICPC World Finals 2025 10/12문제 금메달
IMO-Proof Bench 72.9점 — 활성 파라미터 10배 많은 DeepSeek-Math-V2-671B-A37B(80.2)와 8점 차이

How to Apply

멀티 도메인 RL 훈련 시 도메인 순서를 신중히 설계하라: 서로 간섭이 큰 태스크(IF vs RLHF)는 분리하고, 응답 길이·검증 시간이 유사한 태스크(MCQA+tool calling+structured output)는 묶어서 multi-domain RL로 한 번에 훈련하면 효율적
RL 훈련 중 성능 퇴행이 생기면 외부 모델 없이 Cascade RL 중간 체크포인트를 domain별 teacher로 골라 MOPD를 적용하면 빠르게 회복 가능 — 40~50스텝 안에 수렴
코드 RL 데이터셋 구성 시 쉬운 문제 제거가 핵심: GPT-OSS-120B가 8/8 rollout 모두 맞추는 문제는 제거해서 3.5K 고난이도 문제만 남기면 훈련 효율 대폭 향상

Code Example

snippet

# MOPD 핵심 아이디어 구현 스케치
# teacher 모델들을 도메인별로 선택해서 token-level distillation advantage 계산

import torch
import torch.nn.functional as F

def compute_mopd_loss(
    student_logprobs,      # [B, T] 학생 모델의 log p(y_t | s_t)
    teacher_logprobs,      # [B, T] 도메인 teacher의 log p(y_t | s_t)
    inf_logprobs,          # [B, T] 추론 엔진(고정)의 log p(y_t | s_t)
    eps_low=0.5,
    eps_high=2.0
):
    # Token-level distillation advantage (reverse KL)
    # 양수면 teacher가 해당 토큰에 더 높은 확률 → 학생이 따라가야 함
    advantage = teacher_logprobs - student_logprobs.detach()  # a_t^MOPD
    
    # Truncated importance weight (train-infer mismatch 보정)
    with torch.no_grad():
        ratio = torch.exp(student_logprobs - inf_logprobs)  # π_train / π_inf
        weight = ratio.clamp(0) * ((ratio >= eps_low) & (ratio <= eps_high)).float()
    
    # Surrogate objective: teacher advantage를 따라가도록 student 업데이트
    loss = -(weight * advantage.detach() * student_logprobs).mean()
    return loss

# 사용 예시
# domain_teachers = {
#     'math': math_sft_checkpoint,
#     'rlvr': early_ifrl_checkpoint,  
#     'rlhf': rlhf_checkpoint
# }
# 각 배치의 도메인에 맞는 teacher를 선택해서 위 함수 적용

Terminology

MoE (Mixture of Experts)모델 전체 파라미터 중 일부만 활성화해서 추론하는 구조. 30B 파라미터 중 3B만 실제로 사용하니까 속도는 3B급이면서 지식은 30B급.

Cascade RL도메인별로 RL을 순차적으로 쌓는 훈련 방식. 수학 → 코딩 → SWE 순서로 차례차례 전문화시켜서 각 단계가 이전 단계를 덮어쓰지 않게 함.

On-Policy Distillation현재 학생 모델이 생성한 응답을 바탕으로 teacher 모델의 확률 분포를 따라가게 학습하는 방법. 스파스한 정답/오답 피드백 대신 토큰 하나하나에 dense한 학습 신호 제공.

GRPO (Group Relative Policy Optimization)같은 문제에 여러 답변을 생성하고, 그 중 잘한 것과 못한 것을 비교해서 강화학습하는 알고리즘. PPO보다 가치함수 없이 단순하게 구현 가능.

SFT (Supervised Fine-Tuning)모범답안 데이터를 보여주고 따라하게 하는 학습. RL 훈련 전에 기초 능력을 다지는 첫 번째 단계.

Catastrophic Forgetting새로운 도메인을 학습하면서 이전에 잘 하던 것을 잊어버리는 현상. Cascade RL의 핵심 해결 문제.

RLHF (Reinforcement Learning from Human Feedback)사람이 선호하는 답변 방향으로 모델을 강화학습시키는 방법. 여기서는 사람 대신 Qwen3-235B를 judge로 써서 자동화.

TIR (Tool-Integrated Reasoning)모델이 추론 중 Python 실행기를 직접 호출해서 계산을 검증하면서 풀 수 있는 모드. 수학 계산이나 코딩 문제에서 정확도 향상.

Related Resources

Original Abstract (Expand)

We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight LLM, after DeepSeekV3.2-Speciale-671B-A37B, to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals, demonstrating remarkably high intelligence density with 20x fewer parameters. In contrast to Nemotron-Cascade 1, the key technical advancements are as follows. After SFT on a meticulously curated dataset, we substantially expand Cascade RL to cover a much broader spectrum of reasoning and agentic domains. Furthermore, we introduce multi-domain on-policy distillation from the strongest intermediate teacher models for each domain throughout the Cascade RL process, allowing us to efficiently recover benchmark regressions and sustain strong performance gains along the way. We release the collection of model checkpoint and training data.