Context Bootstrapped Reinforcement Learning

Mar 19, 2026•Saaket Agashe, Jayanth Srinivasa, Gaowen Liu +2•View PDF

TL;DR Highlight

Gradually injecting few-shot examples early in RL training then slowly removing them lets the model internalize reasoning patterns on its own.

Who Should Read

Researchers and engineers training reasoning models with RL who want to bootstrap better reasoning without permanently relying on in-context examples.

Core Mechanics

Standard RL training from scratch for reasoning is unstable — models struggle to discover good reasoning patterns via trial and error alone
Providing few-shot examples throughout RL training helps but creates dependency on examples at inference time
The proposed curriculum: inject few-shot examples at high density early in training, then gradually reduce their frequency to zero over training
This lets the model bootstrap from the examples' reasoning patterns and then internalize them, removing the inference-time dependency
The approach outperforms both vanilla RL (no examples) and constant few-shot RL on math and reasoning benchmarks
The gradual removal schedule matters — abrupt removal causes forgetting, while gradual removal allows proper internalization

Evidence

On MATH benchmark: gradual few-shot curriculum RL outperformed vanilla RL by 8+ points and constant few-shot RL by 3+ points
Models trained with gradual removal generalized better to out-of-distribution problems than models trained with constant examples
Abrupt removal (injecting then suddenly dropping examples) showed significant performance degradation compared to gradual removal

How to Apply

Start RL training with few-shot examples in every prompt. Over the first 30-40% of training steps, linearly reduce the probability of including examples from 1.0 to 0.0.
Monitor reward signal stability during the removal phase — if rewards drop sharply, slow the removal rate. Gradual decay should maintain reward improvement trajectory.
Select few-shot examples that demonstrate the specific reasoning patterns you want the model to internalize — the quality of examples matters more here than in standard few-shot prompting.

Code Example

snippet

# CBRL core logic implementation example
import random

class CBRLScheduler:
    def __init__(self, p_start=0.5, p_end=0.0, total_steps=500):
        self.p_start = p_start
        self.p_end = p_end
        self.total_steps = total_steps
    
    def get_injection_prob(self, current_step):
        """Calculate injection probability with linear annealing"""
        t = current_step
        T = self.total_steps
        return self.p_start + (t - 1) / (T - 1) * (self.p_end - self.p_start)

def compose_prompt(query, few_shot_bank, injection_prob, k=2):
    """Stochastically prepend few-shot examples to the prompt"""
    if random.random() < injection_prob:
        examples = random.sample(few_shot_bank, min(k, len(few_shot_bank)))
        # Compose few-shot as user-assistant exchange format
        messages = []
        for ex in examples:
            messages.append({"role": "user", "content": ex["question"]})
            messages.append({"role": "assistant", "content": f"<think>{ex['reasoning']}</think><answer>{ex['answer']}</answer>"})
        messages.append({"role": "user", "content": query})
    else:
        # Just the question without few-shot
        messages = [{"role": "user", "content": query}]
    return messages

# Usage in training loop
scheduler = CBRLScheduler(p_start=0.5, p_end=0.0, total_steps=500)

for step in range(1, 501):
    p = scheduler.get_injection_prob(step)
    batch_prompts = [
        compose_prompt(query, few_shot_bank, p, k=2)
        for query in training_batch
    ]
    # Proceed with normal training using GRPO/RLOO
    # Rewards are computed based solely on model responses without few-shot

Terminology

Few-Shot ExamplesA small number of (input, output) examples provided in the prompt to demonstrate the desired behavior pattern.

RL TrainingTraining a model using reinforcement learning — the model learns by trial and error, getting rewards for correct outputs.

Curriculum LearningTraining strategy where the difficulty or format of training data changes over time — typically starting easier/more scaffolded and becoming harder/less scaffolded.

InternalizationWhen a model learns a pattern deeply enough that it can apply it without explicit demonstration in the prompt.

MATH BenchmarkA standard benchmark for evaluating mathematical reasoning ability of language models.

Related Resources

Original Abstract (Expand)

Reinforcement Learning from Verifiable Rewards (RLVR) suffers from exploration inefficiency, where models struggle to generate successful rollouts, resulting in minimal learning signal. This challenge is particularly severe for tasks that require the acquisition of novel reasoning patterns or domain-specific knowledge. To address this, we propose Context Bootstrapped Reinforcement Learning (CBRL), which augments RLVR training by stochastically prepending few-shot demonstrations to training prompts. The injection probability follows a curriculum that starts high to bootstrap early exploration, then anneals to zero so the model must ultimately succeed without assistance. This forces the policy to internalize reasoning patterns from the demonstrations rather than relying on them at test time. We validate CBRL across two model families and five Reasoning Gym tasks. Our results demonstrate that CBRL consistently improves success rate, provides better exploration efficiency, and is algorithm-agnostic. We further demonstrate CBRL's practical applicability on Q, a domain-specific programming language that diverges significantly from mainstream language conventions.