언어 모델을 위한 Online Experiential Learning (OEL)

Online Experiential Learning for Language Models

Mar 17, 2026•Tianzhu Ye, Li Dong, Qingxiu Dong +3•View PDF

TL;DR Highlight

배포 후에도 실제 사용 경험에서 스스로 학습하는 LLM 프레임워크 — 보상 함수도, 인간 라벨링도 필요 없음.

Who Should Read

LLM 에이전트를 실제 환경에 배포하고 지속적으로 성능을 개선하고 싶은 ML 엔지니어. 특히 배포 이후 모델 업데이트 파이프라인을 설계 중인 개발자.

Core Mechanics

배포된 모델이 실제 환경과 상호작용한 텍스트 궤적(trajectory)에서 '경험 지식'을 자동 추출하고, 이를 모델 파라미터에 내재화하는 2단계 루프 구조
보상 함수, 리워드 모델, 인간 어노테이션 없이 텍스트 피드백만으로 온라인 학습 가능 — 완전 reward-free
On-Policy Context Distillation(모델 자신이 생성한 응답을 학습 데이터로 쓰는 기법)으로 OOD(분포 밖) 성능 저하(catastrophic forgetting)를 방지
raw 궤적을 그대로 쓰는 것보다 추출된 경험 지식이 훨씬 효과적 — Sokoban에서 pass rate 7.8% vs 21.4%
더 큰 모델(Qwen3-4B)의 경험 지식보다 자기 자신(Qwen3-1.7B)의 경험 지식이 더 효과적 — on-policy 일관성이 핵심
반복할수록 응답 길이가 줄어들어 토큰 효율도 개선 — Frozen Lake에서 3라운드 후 응답 길이 약 70% 수준으로 감소

Evidence

Sokoban에서 경험 지식 in-context 사용 시 pass rate 18.2%, consolidation 후 21.4% vs raw trajectory 사용 시 10.9% / 7.8%
Frozen Lake에서 Qwen3-1.7B 기준 자기 자신의 경험 지식 사용 시 pass rate 23.8%(in-context), 31.1%(consolidation) vs 더 큰 모델(Qwen3-4B) 경험 지식 사용 시 18.0% / 22.7%
On-policy context distillation이 off-policy 대비 IF-Eval OOD 정확도를 초기 모델 수준(~66~67%)으로 유지, off-policy는 학습 진행에 따라 명확한 성능 하락
Qwen3-1.7B, 4B, 8B 모두 OEL 라운드마다 일관된 pass rate 향상 확인 — 모델 크기와 무관하게 효과적

How to Apply

에이전트 서비스 배포 후 유저와의 멀티턴 대화 로그를 수집하고, 동일 모델로 '– EXPERIENCE ITEM:' 형식의 경험 지식을 추출하는 프롬프트를 주기적으로 실행하면 된다.
추출된 경험 지식을 컨텍스트에 붙여서 teacher로 삼고, 모델 자신이 생성한 응답과의 reverse KL divergence로 파인튜닝 — 환경 접근 없이 서버 측에서만 학습 가능.
RAG나 게임 에이전트처럼 텍스트 피드백이 있는 환경이라면 바로 적용 가능 — 보상 설계가 어려운 오픈 도메인 에이전트에 특히 유용하다.

Code Example

snippet

# 경험 지식 추출 프롬프트 예시 (structured format)
prompt_template = """
You are an AI language model that continuously refines its internal experience.

Here is the interaction history (the environment (input) and your response and action (output)):
{latest_experience}

Here is the previous experience:
# Experience
{previous_experience}

Your task:
Based on the multi-round interaction history, generate experience for future learning.
Conduct a deep, comparative analysis to infer the rules and the fundamental principles behind success and failure.
Organize insights into 1-2 concise, high-level, widely applicable experience items.

Rules:
- Format MUST be:
  - EXPERIENCE ITEM: ...
- Do NOT repeat previous experience.
- Make experience general, not specific to the current case.

Additional Experience:
# Experience
- EXPERIENCE ITEM:
"""

# 경험 지식을 컨텍스트에 붙여 새 문제 풀기
solving_prompt = """
You are an agent acting as a reasoning engine.
Your decisions are based on the experience you have learned.
This experience may be incomplete or incorrect.

Given experience:
{experience}

Current situation:
{prompt}

What action do you take?
"""

Terminology

On-Policy모델이 직접 생성한 데이터로 자기 자신을 학습하는 방식. 자신이 쓴 글을 보고 스스로 교정하는 것과 비슷.

Context Distillation긴 컨텍스트(예: 경험 지식)를 보고 답하는 선생 모델의 능력을 컨텍스트 없이도 작동하도록 학생 모델 파라미터에 압축하는 기법.

Reverse KL Divergence두 확률 분포의 차이를 측정하는 방법 중 하나. 학생 모델이 선생 모델의 핵심 패턴을 집중적으로 따라가도록 유도함.

Catastrophic Forgetting새로운 것을 학습하다가 이전에 알던 것을 잊어버리는 현상. 파인튜닝 후 범용 능력이 떨어지는 문제.

Trajectory에이전트가 환경과 주고받은 행동-피드백의 시퀀스 전체. 게임 한 판의 전체 기록 같은 것.

OOD (Out-of-Distribution)학습 데이터와 다른 분포의 입력. 특정 게임을 학습했을 때 다른 종류의 질문에도 여전히 잘 답하는지 확인하는 지표.

Reward-Free학습에 숫자 점수(보상)가 필요 없다는 뜻. 텍스트 피드백만으로 학습 신호를 만들어 사용.

Related Resources

Original Abstract (Expand)

The prevailing paradigm for improving large language models relies on offline training with human annotations or simulated environments, leaving the rich experience accumulated during real-world deployment entirely unexploited. We propose Online Experiential Learning (OEL), a framework that enables language models to continuously improve from their own deployment experience. OEL operates in two stages: first, transferable experiential knowledge is extracted and accumulated from interaction trajectories collected on the user side; second, this knowledge is consolidated into model parameters via on-policy context distillation, requiring no access to the user-side environment. The two stages are iterated to form an online learning loop, where the improved model collects higher-quality trajectories that yield richer experiential knowledge for subsequent rounds. We evaluate OEL on text-based game environments across multiple model scales and both thinking and non-thinking variants. OEL achieves consistent improvements over successive iterations, enhancing both task accuracy and token efficiency while preserving out-of-distribution performance. Our analysis further shows that extracted experiential knowledge is significantly more effective than raw trajectories, and that on-policy consistency between the knowledge source and the policy model is critical for effective learning.