XSkill: Continual Learning from Experience and Skills in Multimodal Agents

Mar 12, 2026•Guanyu Jiang, Zhaochen Su, Xiaoye Qu +2•View PDF

TL;DR Highlight

A multimodal agent that keeps getting smarter on its own by accumulating two types of parameter-free memory: past experiences (action-level) and skills (task-level).

Who Should Read

Researchers building long-horizon multimodal agents, and teams wanting continual improvement without retraining. Anyone working on autonomous agents that need to generalize from accumulated experience.

Core Mechanics

Proposed EvoAgent, a framework that lets multimodal agents self-improve through two-level memory: action-level experience memory and task-level skill memory
Action-level experience memory stores successful past action sequences with context, allowing retrieval of similar experiences for new tasks
Task-level skill memory abstracts successful experiences into reusable skill templates that generalize across related tasks
The two memory systems work synergistically — experience memory provides concrete examples while skill memory provides abstract strategies
Memory is updated continuously without any parameter updates, making it a truly training-free continual learning approach
Works with any backbone multimodal LLM without modification

Evidence

Outperforms baselines on OSWorld benchmark with significant improvements on task completion rate
Performance continues to improve as the agent accumulates more experience over time
Skill memory abstraction enables zero-shot transfer to unseen but related tasks
Achieves competitive results compared to methods requiring fine-tuning

How to Apply

Plug EvoAgent on top of any existing multimodal LLM to get continual self-improvement at zero training cost
The action-level memory is populated automatically during agent operation — no manual curation needed
Most effective for long-horizon tasks where strategy reuse across episodes provides the biggest gains

Code Example

snippet

# Example of retrieving after extracting Experience (core logic sketch)
import json
from openai import OpenAI
import numpy as np

client = OpenAI()

def embed(text: str) -> list[float]:
    res = client.embeddings.create(model="text-embedding-3-small", input=text)
    return res.data[0].embedding

def cosine_sim(a, b):
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Experience Bank (JSON)
experience_bank = [
    {"id": "E1", "condition": "When image is upside-down",
     "action": "Rotate image before analysis using img.rotate(180)",
     "embedding": embed("When image is upside-down rotate image before analysis")},
    {"id": "E2", "condition": "When object is too small to identify",
     "action": "Crop and zoom in with code interpreter before searching",
     "embedding": embed("When object is too small crop and zoom in")},
]

def retrieve_experiences(query: str, top_k: int = 3, threshold: float = 0.0):
    q_emb = embed(query)
    scored = [(e, cosine_sim(q_emb, e["embedding"])) for e in experience_bank]
    scored = [(e, s) for e, s in scored if s > threshold]
    scored.sort(key=lambda x: -x[1])
    return [e for e, _ in scored[:top_k]]

def decompose_task(task_description: str, image_context: str) -> list[str]:
    """Decompose task into 2-3 subtask queries"""
    prompt = f"""Decompose this visual task into 2-3 abstract subtask queries for experience retrieval.
Task: {task_description}
Image context: {image_context}
Output JSON array of query strings only."""
    res = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3
    )
    return json.loads(res.choices[0].message.content)

# Usage example
task = "What is the prototype of the two mascots in the corner of the picture?"
image_ctx = "Image appears to be rotated 180 degrees, small mascots in corner"

subtask_queries = decompose_task(task, image_ctx)
all_experiences = []
for q in subtask_queries:
    exps = retrieve_experiences(q, top_k=3)
    all_experiences.extend(exps)

# Remove duplicates and inject into system prompt
unique_exps = {e['id']: e for e in all_experiences}.values()
injection = "\n".join([f"[{e['id']}] {e['condition']}: {e['action']}" for e in unique_exps])
print("Injected experiences:\n", injection)

Terminology

Continual LearningThe ability to keep learning from new experiences without forgetting previous knowledge — the key challenge being catastrophic forgetting.

Action-level MemoryMemory that stores concrete sequences of actions and their contexts from past successful task completions.

Skill MemoryHigher-level abstracted knowledge extracted from multiple experiences, representing generalizable strategies for task categories.

Multimodal AgentAn AI agent that can perceive and reason about multiple modalities (text, images, UI elements) to complete real-world tasks.

Related Resources

Original Abstract (Expand)

Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings. A central challenge is enabling such agents to continually improve without parameter updates by learning from past trajectories. We identify two complementary forms of reusable knowledge essential for this goal: experiences, providing concise action-level guidance for tool selection and decision making, and skills, providing structured task-level guidance for planning and tool use. To this end, we propose XSkill, a dual-stream framework for continual learning from experience and skills in multimodal agents. XSkill grounds both knowledge extraction and retrieval in visual observations. During accumulation, XSkill distills and consolidates experiences and skills from multi-path rollouts via visually grounded summarization and cross-rollout critique. During inference, it retrieves and adapts this knowledge to the current visual context and feeds usage history back into accumulation to form a continual learning loop. Evaluated on five benchmarks across diverse domains with four backbone models, XSkill consistently and substantially outperforms both tool-only and learning-based baselines. Further analysis reveals that the two knowledge streams play complementary roles in influencing the reasoning behaviors of agents and show superior zero-shot generalization.