SkillRL: 재귀적 스킬 증강 강화학습을 통한 LLM 에이전트 진화

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Feb 9, 2026•Peng Xia, Jianwen Chen, Han Wang +10•View PDF

TL;DR Highlight

LLM 에이전트가 경험에서 재사용 가능한 스킬을 자동 추출하고, RL 학습 중 스킬 라이브러리가 에이전트 정책과 함께 진화해 기존 메모리 기반 방법 대비 15.3% 이상 성능을 끌어올린다.

Who Should Read

ReAct·Reflexion 같은 프롬프트 기반 에이전트의 한계를 느끼고 강화학습 기반 에이전트 학습을 고민하는 ML 엔지니어. 특히 에이전트가 과거 실패를 재사용 가능한 지식으로 축적하게 만들고 싶은 연구자나 개발자.

Core Mechanics

성공한 trajectory뿐 아니라 실패한 trajectory도 수집해서 각각 '성공 패턴'과 '실패 교훈(failure lesson)'으로 변환 — 기존 방법들이 실패를 버리던 것과 반대
스킬을 General Skills(탐색 전략 등 범용)과 Task-Specific Skills(태스크별 세부 절차)로 나눈 계층형 라이브러리 SKILLBANK 구성, 추론 시 임베딩 유사도로 검색
raw trajectory 저장 대비 스킬 추출로 10~20배 토큰 압축 달성 — 컨텍스트를 덜 써도 더 좋은 성능
RL 학습 중 validation 실패를 분석해 새 스킬을 추가하거나 기존 스킬을 개선하는 재귀적 진화 메커니즘 — 스킬 라이브러리가 정적 DB가 아니라 살아있는 지식 베이스
Cold-start SFT(지도 파인튜닝) 단계로 베이스 모델이 스킬 활용법을 먼저 학습한 뒤 RL 진입 — 없으면 성능 20% 하락
Qwen2.5-7B-Instruct 기반으로 GPT-4o를 ALFWorld에서 41.9%p, Gemini-2.5-Pro를 29.6%p 앞지름

Evidence

ALFWorld 성공률 89.9% — 베이스 GRPO 77.6% 대비 +12.3%p, 최강 메모리 증강 RL(SimpleMem+GRPO 62.5%) 대비 +27.4%p
WebShop 성공률 72.7% — 비교군 최고(GRPO 66.1%) 대비 +6.6%p
Search-augmented QA 7개 태스크 평균 47.1% — EvolveR 43.1% 대비 +4%p, Bamboogle 단일 태스크에서 EvolveR 대비 +19.4%p
평균 프롬프트 길이 raw trajectory 메모리(약 1,450 토큰) 대비 약 1,300 토큰으로 10.3% 감소하면서 성능은 더 높음

How to Apply

에이전트 롤아웃 시 성공/실패 trajectory를 모두 저장하고, GPT-4o·o3 같은 teacher 모델에게 성공 패턴과 실패 원인을 JSON 스킬 형태로 추출하게 한다. 스킬 포맷은 'id, 제목(3~5단어), 원칙(1~2문장), 적용 조건'으로 구성.
추출한 스킬을 General(모든 태스크 공통)과 Task-Specific(태스크 유형별)로 분리해 SKILLBANK를 만들고, 추론 시 General은 항상 포함, Task-Specific은 임베딩 코사인 유사도 Top-K로 검색해 컨텍스트에 주입한다.
RL 학습 중 일정 주기(예: 5 step마다) validation을 돌리고, 성공률이 임계값(δ=0.4) 미만인 태스크 카테고리의 실패 trajectory를 모아 teacher 모델에게 분석시켜 신규 스킬 1~3개를 SKILLBANK에 추가한다.

Code Example

snippet

# SKILLRL 핵심 프롬프트 — 실패 trajectory에서 신규 스킬 동적 발견

SKILL_DISCOVERY_PROMPT = """
Analyze these failed {env_description} agent trajectories and suggest NEW skills to add.

FAILED TRAJECTORIES:
{failure_examples}

EXISTING SKILL TITLES:
{existing_titles}

Generate 1-3 NEW actionable skills that would help avoid these failures.
Each skill must have: skill_id, title (3-5 words), principle (1-2 sentences), when_to_apply.
The skill_id should follow the pattern: "dyn_001", "dyn_002", etc.

Return ONLY a JSON array of skills, no other text.
"""

# 에이전트 실행 시 스킬 주입 프롬프트 구조
AGENT_EXECUTION_PROMPT = """
You are an expert agent. Your task is: {task_description}

## Retrieved Relevant Experience
{retrieved_skills}  # General Skills + Top-K Task-Specific Skills

## Current Progress
{action_history}

Current observation: {current_observation}
Admissible actions: {admissible_actions}

Reason step-by-step inside <think></think>, then output action inside <action></action>.
"""

# 스킬 검색 로직 (의사코드)
def retrieve_skills(task_description, skillbank, top_k=6, threshold=0.4):
    general_skills = skillbank.general  # 항상 포함
    task_emb = embed(task_description)
    task_specific = [
        s for s in skillbank.task_specific
        if cosine_sim(task_emb, embed(s)) > threshold
    ]
    top_specific = sorted(task_specific, key=lambda s: cosine_sim(task_emb, embed(s)), reverse=True)[:top_k]
    return general_skills + top_specific

Terminology

GRPO모델이 여러 응답을 샘플링한 뒤 그룹 내 상대적 보상으로 정책을 업데이트하는 RL 기법. 별도 critic 모델 없이도 작동해서 학습 비용이 낮음.

SFT정답 예시를 보여주고 따라하게 만드는 지도 학습(Supervised Fine-Tuning). 모범 답안 풀이를 보고 학습하는 것과 같음.

trajectory에이전트가 환경과 주고받은 관찰·행동·보상의 전체 시퀀스. 에피소드 하나의 '플레이 기록'.

KL divergence두 확률 분포가 얼마나 다른지 측정하는 값. RL에서 정책이 레퍼런스 모델에서 너무 멀리 벗어나지 않도록 제어하는 안전장치로 사용.

SKILLBANK이 논문에서 만든 계층형 스킬 라이브러리. 범용 전략(General)과 태스크별 세부 노하우(Task-Specific)를 분리 저장하는 에이전트의 장기 기억.

cold-start SFTRL 학습 시작 전 베이스 모델에게 '스킬을 어떻게 읽고 쓰는지' 예시로 먼저 가르치는 단계. 없으면 RL이 스킬을 무시해버림.

ALFWorld텍스트 명령으로 가상 집 안을 탐색하며 물건을 집거나 청소하는 등 가사 태스크를 수행하는 에이전트 벤치마크 환경.

Related Resources

https://github.com/aiming-lab/SkillRL

Original Abstract (Expand)

Large Language Model (LLM) agents have shown stunning results in complex tasks, yet they often operate in isolation, failing to learn from past experiences. Existing memory-based methods primarily store raw trajectories, which are often redundant and noise-heavy. This prevents agents from extracting high-level, reusable behavioral patterns that are essential for generalization. In this paper, we propose SkillRL, a framework that bridges the gap between raw experience and policy improvement through automatic skill discovery and recursive evolution. Our approach introduces an experience-based distillation mechanism to build a hierarchical skill library SkillBank, an adaptive retrieval strategy for general and task-specific heuristics, and a recursive evolution mechanism that allows the skill library to co-evolve with the agent's policy during reinforcement learning. These innovations significantly reduce the token footprint while enhancing reasoning utility. Experimental results on ALFWorld, WebShop and seven search-augmented tasks demonstrate that SkillRL achieves state-of-the-art performance, outperforming strong baselines over 15.3% and maintaining robustness as task complexity increases. Code is available at this https://github.com/aiming-lab/SkillRL.