Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation
TL;DR Highlight
NVIDIA's recipe for training a 30B MoE open-source reasoning model that won gold medals at IMO and IOI 2025.
Who Should Read
Researchers and engineers training large-scale reasoning models, particularly those interested in MoE architectures and competition-level math/code performance.
Core Mechanics
- NVIDIA trained a 30B MoE model that achieved gold medal performance on IMO 2025 (mathematics) and IOI 2025 (competitive programming)
- The training recipe combines: large-scale pretraining on math/code heavy data, SFT on curated reasoning traces, and extended RL with verifiable rewards
- MoE (Mixture of Experts) architecture enables a 30B parameter model to achieve performance comparable to much larger dense models while being more efficient to deploy
- Key insight: reasoning ability requires long-horizon RL training — short RL runs produce local optima that don't generalize to competition-level problems
- The model is released as open-source, providing the community with a strong baseline for reasoning research
- Competition performance required problem-specific strategies beyond pure model capability — the training includes learning when and how to use different reasoning strategies
Evidence
- IMO 2025: model solved 5/6 problems, achieving gold medal threshold — comparable to top human competitors
- IOI 2025: model scored in gold medal range on competitive programming problems
- On standard reasoning benchmarks (MATH-500, AMC, AIME): outperforms comparably-sized dense models by 15-20%
How to Apply
- For training reasoning models: allocate significant budget to the RL phase with long rollouts — the paper shows that short RL training (< 10K steps) is insufficient for competition-level reasoning.
- Use verifiable rewards (unit tests for code, formal verification for math) rather than LLM-judge rewards in the RL phase — this prevents reward hacking and enables more aggressive optimization.
- MoE architecture is worth considering for reasoning models: the routing mechanism can specialize different experts for different problem types, improving sample efficiency.
Code Example
# MOPD core idea implementation sketch
# Select teacher models per domain and compute token-level distillation advantage
import torch
import torch.nn.functional as F
def compute_mopd_loss(
student_logprobs, # [B, T] student model's log p(y_t | s_t)
teacher_logprobs, # [B, T] domain teacher's log p(y_t | s_t)
inf_logprobs, # [B, T] inference engine (frozen) log p(y_t | s_t)
eps_low=0.5,
eps_high=2.0
):
# Token-level distillation advantage (reverse KL)
# Positive value means teacher assigns higher probability to that token → student should follow
advantage = teacher_logprobs - student_logprobs.detach() # a_t^MOPD
# Truncated importance weight (corrects train-infer mismatch)
with torch.no_grad():
ratio = torch.exp(student_logprobs - inf_logprobs) # π_train / π_inf
weight = ratio.clamp(0) * ((ratio >= eps_low) & (ratio <= eps_high)).float()
# Surrogate objective: update student to follow teacher advantage
loss = -(weight * advantage.detach() * student_logprobs).mean()
return loss
# Usage example
# domain_teachers = {
# 'math': math_sft_checkpoint,
# 'rlvr': early_ifrl_checkpoint,
# 'rlhf': rlhf_checkpoint
# }
# Select the teacher matching each batch's domain and apply the function aboveTerminology
Related Resources
Original Abstract (Expand)
We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight LLM, after DeepSeekV3.2-Speciale-671B-A37B, to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals, demonstrating remarkably high intelligence density with 20x fewer parameters. In contrast to Nemotron-Cascade 1, the key technical advancements are as follows. After SFT on a meticulously curated dataset, we substantially expand Cascade RL to cover a much broader spectrum of reasoning and agentic domains. Furthermore, we introduce multi-domain on-policy distillation from the strongest intermediate teacher models for each domain throughout the Cascade RL process, allowing us to efficiently recover benchmark regressions and sustain strong performance gains along the way. We release the collection of model checkpoint and training data.