CORE: Contrastive Reflection으로 추론 능력을 빠르게 개선하기

TL;DR Highlight

성공/실패 추론 트레이스를 비교해 짧은 자연어 인사이트를 뽑아내고, 단 5개 학습 샘플로도 GRPO보다 빠르게 모델 추론 성능을 올리는 비파라메트릭 알고리즘.

Who Should Read

LLM 추론 성능을 개선해야 하는 ML 엔지니어나 연구자. 특히 학습 데이터가 적거나 파인튜닝 비용 없이 모델을 강화하고 싶은 상황.

Core Mechanics

CORE는 모델 가중치를 전혀 건드리지 않는 비파라메트릭(non-parametric) 방식 - 모델을 frozen 상태로 유지하면서 외부 메모리에 인사이트를 쌓아 성능을 올림.
핵심 아이디어는 '대조 반성(Contrastive Reflection)': 실패한 추론 트레이스와 비슷한 문제의 성공 트레이스를 나란히 놓고 LLM에게 차이점을 분석하게 해서 짧은 자연어 인사이트를 생성.
생성된 인사이트는 admission test를 통과해야만 메모리에 저장됨 - 해당 인사이트를 붙였을 때 실제로 문제를 풀어야 통과, 안 되면 버림.
인사이트 검색 시 semantic similarity(의미적 유사도)와 utility score(과거에 얼마나 도움이 됐는지) 둘 다 고려해서 가져옴 - utility만 빼면 성능이 0.907 → 0.780으로 떨어짐.
실험 모델은 GPT-OSS-120B (OpenAI 오픈소스 120B 모델), 임베딩은 jina-embeddings-v2-base-en(137M BERT 기반)을 사용.
Tower of Hanoi, MathGAP, ZebraLogic, Matchstick Arithmetic 4가지 태스크에서 GRPO, GEPA, Episodic RAG, MemRL 모두 제침.

Evidence

350번의 rollout(학습 시도)만에 CORE의 평균 held-out 정확도가 0.445 → 0.712로 59.9% 향상 - 이 시점에서 이미 4000 rollout을 쓴 모든 베이스라인의 최고 성능을 초과.
5개 학습 샘플만으로도 12개 조건(4 태스크 × 3 데이터 크기) 중 9개에서 1위, 평균 54.8% 성능 향상(no-learning 대비).
컨텍스트 효율성: CORE는 평가 시 평균 0.92k 토큰 추가, Episodic RAG는 33.6k, MemRL은 32.7k - 각각 36.6배, 35.6배 더 많은 토큰을 씀.
Contrastive Reflection 제거 실험: 실패 트레이스만 보는 경우 0.617, 성공 트레이스만 보는 경우 0.830, 둘 다 비교하는 CORE는 0.907 (Matchstick Arithmetic, 10샘플).

How to Apply

검색/추천 시스템에서 사용자 피드백 루프가 있는 경우: 정답/오답이 나올 때마다 성공-실패 쌍을 저장하고, LLM에게 '이 두 응답의 차이가 뭔지 규칙으로 설명해줘'를 주기적으로 실행해 인사이트를 축적하면 됨.
코드 생성이나 SQL 생성 에이전트에 적용할 때: 테스트 통과 여부를 verifier로 사용하고, 실패한 코드와 비슷한 문제의 성공 코드를 비교해 추출한 규칙(예: '배열 인덱스 확인', 'NULL 처리')을 프롬프트에 점진적으로 추가.
admission test 패턴만 빌려서 프롬프트 최적화에 활용: 새 규칙이나 지침을 프롬프트에 추가하기 전에 소수 샘플로 효과를 검증하고, baseline 성공률보다 높을 때만 채택하는 gating 로직을 파이프라인에 붙이면 됨.

Code Example

snippet

Terminology

non-parametric모델 가중치(파라미터)를 전혀 수정하지 않는 방식. 모델 자체는 그대로 두고 외부 메모리나 프롬프트만 바꿔서 성능을 올림.

GRPODeepSeek-R1에서 쓴 강화학습 기법. 모델이 여러 번 시도해보고, 잘된 것은 강화 나쁜 것은 억제하는 방향으로 가중치를 업데이트함.

RLVRReinforcement Learning from Verifiable Rewards. 정답이 명확히 검증 가능한 문제(수학, 코딩 등)에서 맞으면 보상, 틀리면 패널티를 줘서 모델을 학습시키는 방식.

rollout모델이 한 문제를 한 번 풀어보는 시도 단위. 100 rollout이면 100번 시도한 것.

Episodic RAG과거에 풀었던 문제와 풀이 전체를 저장해두고, 새 문제가 오면 비슷한 과거 풀이를 꺼내 컨텍스트로 넣어주는 방식.

MemRL에피소딕 메모리(경험 기록)에 가치 점수를 붙여서, 과거에 도움이 됐던 경험 위주로 검색하는 메모리 시스템.

admission test새로 생성된 인사이트가 실제로 효과 있는지 검증하는 관문. 인사이트를 붙여서 문제를 풀어보고, baseline보다 성공률이 높아야만 메모리에 저장됨.

utility score특정 인사이트가 과거에 얼마나 도움이 됐는지 수치화한 점수. 높을수록 비슷한 문제에서 자주 꺼내서 씀.

Related Resources

CORE GitHub Repository

Original Abstract (Expand)

Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, we then show that CORE also achieves comparable or greater performance gains than each baseline. Finally, we highlight how CORE is also substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.