Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

Mar 19, 2026•Zhuolin Yang, Zihan Liu, Yang Chen +14•View PDF

TL;DR Highlight

NVIDIA's recipe for training a 30B MoE open-source reasoning model that won gold medals at IMO and IOI 2025.

Who Should Read

Researchers and engineers training large-scale reasoning models, particularly those interested in MoE architectures and competition-level math/code performance.

Core Mechanics

NVIDIA trained a 30B MoE model that achieved gold medal performance on IMO 2025 (mathematics) and IOI 2025 (competitive programming)
The training recipe combines: large-scale pretraining on math/code heavy data, SFT on curated reasoning traces, and extended RL with verifiable rewards
MoE (Mixture of Experts) architecture enables a 30B parameter model to achieve performance comparable to much larger dense models while being more efficient to deploy
Key insight: reasoning ability requires long-horizon RL training — short RL runs produce local optima that don't generalize to competition-level problems
The model is released as open-source, providing the community with a strong baseline for reasoning research
Competition performance required problem-specific strategies beyond pure model capability — the training includes learning when and how to use different reasoning strategies

Evidence

IMO 2025: model solved 5/6 problems, achieving gold medal threshold — comparable to top human competitors
IOI 2025: model scored in gold medal range on competitive programming problems
On standard reasoning benchmarks (MATH-500, AMC, AIME): outperforms comparably-sized dense models by 15-20%

How to Apply

For training reasoning models: allocate significant budget to the RL phase with long rollouts — the paper shows that short RL training (< 10K steps) is insufficient for competition-level reasoning.
Use verifiable rewards (unit tests for code, formal verification for math) rather than LLM-judge rewards in the RL phase — this prevents reward hacking and enables more aggressive optimization.
MoE architecture is worth considering for reasoning models: the routing mechanism can specialize different experts for different problem types, improving sample efficiency.

Code Example

snippet

# MOPD core idea implementation sketch
# Select teacher models per domain and compute token-level distillation advantage

import torch
import torch.nn.functional as F

def compute_mopd_loss(
    student_logprobs,      # [B, T] student model's log p(y_t | s_t)
    teacher_logprobs,      # [B, T] domain teacher's log p(y_t | s_t)
    inf_logprobs,          # [B, T] inference engine (frozen) log p(y_t | s_t)
    eps_low=0.5,
    eps_high=2.0
):
    # Token-level distillation advantage (reverse KL)
    # Positive value means teacher assigns higher probability to that token → student should follow
    advantage = teacher_logprobs - student_logprobs.detach()  # a_t^MOPD
    
    # Truncated importance weight (corrects train-infer mismatch)
    with torch.no_grad():
        ratio = torch.exp(student_logprobs - inf_logprobs)  # π_train / π_inf
        weight = ratio.clamp(0) * ((ratio >= eps_low) & (ratio <= eps_high)).float()
    
    # Surrogate objective: update student to follow teacher advantage
    loss = -(weight * advantage.detach() * student_logprobs).mean()
    return loss

# Usage example
# domain_teachers = {
#     'math': math_sft_checkpoint,
#     'rlvr': early_ifrl_checkpoint,  
#     'rlhf': rlhf_checkpoint
# }
# Select the teacher matching each batch's domain and apply the function above

Terminology

MoE (Mixture of Experts)A model architecture where different subsets of parameters (experts) are activated for different inputs — enables large total parameter counts with smaller compute per forward pass.

IMOInternational Mathematical Olympiad — the most prestigious mathematics competition for pre-university students, considered a gold standard for mathematical reasoning.

IOIInternational Olympiad in Informatics — the most prestigious competitive programming competition for pre-university students.

RL with Verifiable RewardsRL training where correctness can be automatically verified (e.g., code passes test cases, math proof is formally valid) rather than requiring human or LLM evaluation.

Long-Horizon RLRL training with long rollouts where the model must make many sequential decisions before receiving a reward signal.

Related Resources

Original Abstract (Expand)

We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight LLM, after DeepSeekV3.2-Speciale-671B-A37B, to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals, demonstrating remarkably high intelligence density with 20x fewer parameters. In contrast to Nemotron-Cascade 1, the key technical advancements are as follows. After SFT on a meticulously curated dataset, we substantially expand Cascade RL to cover a much broader spectrum of reasoning and agentic domains. Furthermore, we introduce multi-domain on-policy distillation from the strongest intermediate teacher models for each domain throughout the Cascade RL process, allowing us to efficiently recover benchmark regressions and sustain strong performance gains along the way. We release the collection of model checkpoint and training data.