XSkill: Continual Learning from Experience and Skills in Multimodal Agents
TL;DR Highlight
A multimodal agent that keeps getting smarter on its own by accumulating two types of parameter-free memory: past experiences (action-level) and skills (task-level).
Who Should Read
Researchers building long-horizon multimodal agents, and teams wanting continual improvement without retraining. Anyone working on autonomous agents that need to generalize from accumulated experience.
Core Mechanics
- Proposed EvoAgent, a framework that lets multimodal agents self-improve through two-level memory: action-level experience memory and task-level skill memory
- Action-level experience memory stores successful past action sequences with context, allowing retrieval of similar experiences for new tasks
- Task-level skill memory abstracts successful experiences into reusable skill templates that generalize across related tasks
- The two memory systems work synergistically — experience memory provides concrete examples while skill memory provides abstract strategies
- Memory is updated continuously without any parameter updates, making it a truly training-free continual learning approach
- Works with any backbone multimodal LLM without modification
Evidence
- Outperforms baselines on OSWorld benchmark with significant improvements on task completion rate
- Performance continues to improve as the agent accumulates more experience over time
- Skill memory abstraction enables zero-shot transfer to unseen but related tasks
- Achieves competitive results compared to methods requiring fine-tuning
How to Apply
- Plug EvoAgent on top of any existing multimodal LLM to get continual self-improvement at zero training cost
- The action-level memory is populated automatically during agent operation — no manual curation needed
- Most effective for long-horizon tasks where strategy reuse across episodes provides the biggest gains
Code Example
# Example of retrieving after extracting Experience (core logic sketch)
import json
from openai import OpenAI
import numpy as np
client = OpenAI()
def embed(text: str) -> list[float]:
res = client.embeddings.create(model="text-embedding-3-small", input=text)
return res.data[0].embedding
def cosine_sim(a, b):
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
# Experience Bank (JSON)
experience_bank = [
{"id": "E1", "condition": "When image is upside-down",
"action": "Rotate image before analysis using img.rotate(180)",
"embedding": embed("When image is upside-down rotate image before analysis")},
{"id": "E2", "condition": "When object is too small to identify",
"action": "Crop and zoom in with code interpreter before searching",
"embedding": embed("When object is too small crop and zoom in")},
]
def retrieve_experiences(query: str, top_k: int = 3, threshold: float = 0.0):
q_emb = embed(query)
scored = [(e, cosine_sim(q_emb, e["embedding"])) for e in experience_bank]
scored = [(e, s) for e, s in scored if s > threshold]
scored.sort(key=lambda x: -x[1])
return [e for e, _ in scored[:top_k]]
def decompose_task(task_description: str, image_context: str) -> list[str]:
"""Decompose task into 2-3 subtask queries"""
prompt = f"""Decompose this visual task into 2-3 abstract subtask queries for experience retrieval.
Task: {task_description}
Image context: {image_context}
Output JSON array of query strings only."""
res = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.3
)
return json.loads(res.choices[0].message.content)
# Usage example
task = "What is the prototype of the two mascots in the corner of the picture?"
image_ctx = "Image appears to be rotated 180 degrees, small mascots in corner"
subtask_queries = decompose_task(task, image_ctx)
all_experiences = []
for q in subtask_queries:
exps = retrieve_experiences(q, top_k=3)
all_experiences.extend(exps)
# Remove duplicates and inject into system prompt
unique_exps = {e['id']: e for e in all_experiences}.values()
injection = "\n".join([f"[{e['id']}] {e['condition']}: {e['action']}" for e in unique_exps])
print("Injected experiences:\n", injection)Terminology
Related Resources
Original Abstract (Expand)
Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings. A central challenge is enabling such agents to continually improve without parameter updates by learning from past trajectories. We identify two complementary forms of reusable knowledge essential for this goal: experiences, providing concise action-level guidance for tool selection and decision making, and skills, providing structured task-level guidance for planning and tool use. To this end, we propose XSkill, a dual-stream framework for continual learning from experience and skills in multimodal agents. XSkill grounds both knowledge extraction and retrieval in visual observations. During accumulation, XSkill distills and consolidates experiences and skills from multi-path rollouts via visually grounded summarization and cross-rollout critique. During inference, it retrieves and adapts this knowledge to the current visual context and feeds usage history back into accumulation to form a continual learning loop. Evaluated on five benchmarks across diverse domains with four backbone models, XSkill consistently and substantially outperforms both tool-only and learning-based baselines. Further analysis reveals that the two knowledge streams play complementary roles in influencing the reasoning behaviors of agents and show superior zero-shot generalization.