Context Bootstrapped Reinforcement Learning: Few-Shot 시연으로 RL 탐색 효율 높이기

Context Bootstrapped Reinforcement Learning

Mar 19, 2026•Saaket Agashe, Jayanth Srinivasa, Gaowen Liu +2•View PDF

TL;DR Highlight

RL 학습 초기에 few-shot 예시를 점진적으로 주입하다가 서서히 제거해서, 모델이 스스로 추론 패턴을 내재화하도록 만드는 방법

Who Should Read

LLM을 특정 도메인(코드, 수학, 특수 언어 등)에 RL로 파인튜닝할 때 초반에 학습 신호가 너무 적어서 고생하는 ML 엔지니어. GRPO나 RLOO 같은 policy gradient 기반 학습을 실무에 적용 중인 개발자.

Core Mechanics

RLVR(검증 가능한 보상 기반 강화학습)의 핵심 문제인 '탐색 비효율' — 모델이 초반에 정답 rollout을 못 만들면 학습 신호가 0이 되는 문제 — 을 few-shot 주입으로 해결
학습 초반엔 확률 p=0.5로 few-shot 예시를 프롬프트 앞에 붙여주고, 학습이 끝날 때쯤 p=0.0으로 선형 감소시키는 커리큘럼 스케줄 사용
추론 과정 포함된 few-shot 예시를 모든 스텝이 아닌 확률적으로(stochastically) 주입해서, 모델이 예시 없이도 혼자 풀어야 하는 상황을 경험하게 만듦
GRPO, RLOO 둘 다에 적용 가능한 algorithm-agnostic 설계 — 입력 데이터만 바꾸고 학습 알고리즘 자체는 건드리지 않음
Word Sorting(RLOO): 20% → 67%, Q 프로그래밍 pass@1: 5.0% → 26.3% 로 크게 향상. few-shot이 완전히 제거된 후에도 성능 유지됨
초기 주입 확률 p=0.5가 최적 — p=1.0이면 모델이 예시에 의존하게 되고, p=0.0이면 학습 신호가 없어서 둘 다 성능 저하

Evidence

Reasoning Gym 5개 태스크 전부에서 CBRL이 baseline GRPO 대비 +1.3% ~ +22.3% 향상 (Qwen2.5-3B, Llama-3.2-3B 둘 다)
Q 프로그래밍(도메인 특화 언어): GRPO 단독 pass@1 5.0% → CBRL-GRPO 26.3%, 평균 테스트 통과율 27.3% → 43.0%
RLOO 알고리즘에서도 효과 확인: Word Sorting 20% → 67%, Puzzle-24 23% → 66%, Spell Backward 63% → 89%
ARC-1D에서 초기 주입 확률 ablation: p=0.0(no injection) 26%, p=0.5(최적) 31%, p=1.0(항상 주입) 23%

How to Apply

도메인 특화 모델을 GRPO로 학습할 때, 학습 프롬프트 앞에 solved example 2개를 p=0.5 확률로 붙여서 시작하고, 학습이 끝날 때쯤 p=0.0으로 선형 감소시키면 됨. 코드 한 줄도 바꿀 필요 없이 데이터 전처리 단계에서만 수정.
few-shot bank 구성: 태스크별로 20~50개 예제를 모아두고, 각 예제에 GPT-4급 모델로 step-by-step reasoning trace를 생성해 (question, reasoning, answer) 형태로 저장. Q 코드 같은 도메인 특화 태스크는 태그 기반 필터링으로 관련 예시만 주입.
학습 초반 reward curve가 flat하게 유지된다면(탐색 비효율 징후) CBRL을 적용해볼 것. GRPO/RLOO 어디에나 붙일 수 있고, 추론 시에는 few-shot 없이 동작하므로 inference 비용 증가 없음.

Code Example

snippet

# CBRL 핵심 로직 구현 예시
import random

class CBRLScheduler:
    def __init__(self, p_start=0.5, p_end=0.0, total_steps=500):
        self.p_start = p_start
        self.p_end = p_end
        self.total_steps = total_steps
    
    def get_injection_prob(self, current_step):
        """선형 annealing으로 주입 확률 계산"""
        t = current_step
        T = self.total_steps
        return self.p_start + (t - 1) / (T - 1) * (self.p_end - self.p_start)

def compose_prompt(query, few_shot_bank, injection_prob, k=2):
    """확률적으로 few-shot 예시를 프롬프트 앞에 붙임"""
    if random.random() < injection_prob:
        examples = random.sample(few_shot_bank, min(k, len(few_shot_bank)))
        # few-shot을 user-assistant 교환 형태로 구성
        messages = []
        for ex in examples:
            messages.append({"role": "user", "content": ex["question"]})
            messages.append({"role": "assistant", "content": f"<think>{ex['reasoning']}</think><answer>{ex['answer']}</answer>"})
        messages.append({"role": "user", "content": query})
    else:
        # few-shot 없이 그냥 질문만
        messages = [{"role": "user", "content": query}]
    return messages

# 학습 루프에서 사용
scheduler = CBRLScheduler(p_start=0.5, p_end=0.0, total_steps=500)

for step in range(1, 501):
    p = scheduler.get_injection_prob(step)
    batch_prompts = [
        compose_prompt(query, few_shot_bank, p, k=2)
        for query in training_batch
    ]
    # 이후 GRPO/RLOO로 정상 학습 진행
    # rewards는 few-shot 없는 모델 응답 기준으로만 계산됨

Terminology

RLVR정답/오답처럼 명확히 검증 가능한 보상으로 LLM을 강화학습시키는 방식. 사람이 일일이 피드백 안 줘도 됨.

GRPOGroup Relative Policy Optimization. 같은 문제에 여러 답변을 생성하고 그 중 상대적으로 좋은 것에 가중치를 줘서 학습하는 RL 알고리즘. DeepSeek-R1에서 유명해짐.

RLOOREINFORCE Leave-One-Out. GRPO와 비슷한 policy gradient 알고리즘인데 gradient 분산이 더 높은 편.

rollout모델이 주어진 프롬프트에 대해 답변을 끝까지 생성하는 과정 한 번. RL에서는 이 과정에서 보상을 수집함.

In-Context Learning (ICL)모델 파라미터를 바꾸지 않고, 프롬프트 안에 예시를 넣어주는 것만으로 모델이 그 패턴을 따라하게 만드는 기법. few-shot prompting이 대표적.

탐색 비효율 (Exploration Inefficiency)RL 학습 초반에 모델이 정답을 한 번도 못 맞춰서 학습 신호(gradient)가 0이 되는 상황. 모험을 안 하니 배울 게 없는 것.

policy gradientRL에서 '좋은 행동의 확률을 높이고, 나쁜 행동의 확률을 낮추는' 방향으로 모델을 업데이트하는 방법론의 총칭.

Related Resources

Original Abstract (Expand)

Reinforcement Learning from Verifiable Rewards (RLVR) suffers from exploration inefficiency, where models struggle to generate successful rollouts, resulting in minimal learning signal. This challenge is particularly severe for tasks that require the acquisition of novel reasoning patterns or domain-specific knowledge. To address this, we propose Context Bootstrapped Reinforcement Learning (CBRL), which augments RLVR training by stochastically prepending few-shot demonstrations to training prompts. The injection probability follows a curriculum that starts high to bootstrap early exploration, then anneals to zero so the model must ultimately succeed without assistance. This forces the policy to internalize reasoning patterns from the demonstrations rather than relying on them at test time. We validate CBRL across two model families and five Reasoning Gym tasks. Our results demonstrate that CBRL consistently improves success rate, provides better exploration efficiency, and is algorithm-agnostic. We further demonstrate CBRL's practical applicability on Q, a domain-specific programming language that diverges significantly from mainstream language conventions.