Context Bootstrapped Reinforcement Learning
TL;DR Highlight
Gradually injecting few-shot examples early in RL training then slowly removing them lets the model internalize reasoning patterns on its own.
Who Should Read
Researchers and engineers training reasoning models with RL who want to bootstrap better reasoning without permanently relying on in-context examples.
Core Mechanics
- Standard RL training from scratch for reasoning is unstable — models struggle to discover good reasoning patterns via trial and error alone
- Providing few-shot examples throughout RL training helps but creates dependency on examples at inference time
- The proposed curriculum: inject few-shot examples at high density early in training, then gradually reduce their frequency to zero over training
- This lets the model bootstrap from the examples' reasoning patterns and then internalize them, removing the inference-time dependency
- The approach outperforms both vanilla RL (no examples) and constant few-shot RL on math and reasoning benchmarks
- The gradual removal schedule matters — abrupt removal causes forgetting, while gradual removal allows proper internalization
Evidence
- On MATH benchmark: gradual few-shot curriculum RL outperformed vanilla RL by 8+ points and constant few-shot RL by 3+ points
- Models trained with gradual removal generalized better to out-of-distribution problems than models trained with constant examples
- Abrupt removal (injecting then suddenly dropping examples) showed significant performance degradation compared to gradual removal
How to Apply
- Start RL training with few-shot examples in every prompt. Over the first 30-40% of training steps, linearly reduce the probability of including examples from 1.0 to 0.0.
- Monitor reward signal stability during the removal phase — if rewards drop sharply, slow the removal rate. Gradual decay should maintain reward improvement trajectory.
- Select few-shot examples that demonstrate the specific reasoning patterns you want the model to internalize — the quality of examples matters more here than in standard few-shot prompting.
Code Example
# CBRL core logic implementation example
import random
class CBRLScheduler:
def __init__(self, p_start=0.5, p_end=0.0, total_steps=500):
self.p_start = p_start
self.p_end = p_end
self.total_steps = total_steps
def get_injection_prob(self, current_step):
"""Calculate injection probability with linear annealing"""
t = current_step
T = self.total_steps
return self.p_start + (t - 1) / (T - 1) * (self.p_end - self.p_start)
def compose_prompt(query, few_shot_bank, injection_prob, k=2):
"""Stochastically prepend few-shot examples to the prompt"""
if random.random() < injection_prob:
examples = random.sample(few_shot_bank, min(k, len(few_shot_bank)))
# Compose few-shot as user-assistant exchange format
messages = []
for ex in examples:
messages.append({"role": "user", "content": ex["question"]})
messages.append({"role": "assistant", "content": f"<think>{ex['reasoning']}</think><answer>{ex['answer']}</answer>"})
messages.append({"role": "user", "content": query})
else:
# Just the question without few-shot
messages = [{"role": "user", "content": query}]
return messages
# Usage in training loop
scheduler = CBRLScheduler(p_start=0.5, p_end=0.0, total_steps=500)
for step in range(1, 501):
p = scheduler.get_injection_prob(step)
batch_prompts = [
compose_prompt(query, few_shot_bank, p, k=2)
for query in training_batch
]
# Proceed with normal training using GRPO/RLOO
# Rewards are computed based solely on model responses without few-shotTerminology
Related Resources
Original Abstract (Expand)
Reinforcement Learning from Verifiable Rewards (RLVR) suffers from exploration inefficiency, where models struggle to generate successful rollouts, resulting in minimal learning signal. This challenge is particularly severe for tasks that require the acquisition of novel reasoning patterns or domain-specific knowledge. To address this, we propose Context Bootstrapped Reinforcement Learning (CBRL), which augments RLVR training by stochastically prepending few-shot demonstrations to training prompts. The injection probability follows a curriculum that starts high to bootstrap early exploration, then anneals to zero so the model must ultimately succeed without assistance. This forces the policy to internalize reasoning patterns from the demonstrations rather than relying on them at test time. We validate CBRL across two model families and five Reasoning Gym tasks. Our results demonstrate that CBRL consistently improves success rate, provides better exploration efficiency, and is algorithm-agnostic. We further demonstrate CBRL's practical applicability on Q, a domain-specific programming language that diverges significantly from mainstream language conventions.