Memento-Skills: Let Agents Design Agents

Mar 19, 2026•Huichi Zhou, Siyuan Guo, Anjie Liu +14•View PDF

TL;DR Highlight

A system where agents self-evolve by accumulating executable 'Skill' files as external memory, without touching LLM parameters

Who Should Read

ML engineers or agent system developers deploying LLM agents to production who need continual learning or per-task performance improvements. Especially useful for teams wanting to keep improving agent capabilities without fine-tuning.

Core Mechanics

Agents grow through experience using markdown-based 'Skills' as external memory, with zero modification to LLM weights/parameters
Implements continual learning via a Read (retrieve relevant Skills) → Act (execute) → Feedback (evaluate results) → Write (update Skills) loop
Uses a Memento-Qwen router trained on actual execution success/failure rather than simple cosine similarity — 87.5% improvement in Recall@1 over BM25
On failure, the LLM automatically identifies which Skill caused the issue (failure attribution) and directly edits or creates new Skill files
Achieved 26.2% relative improvement in overall accuracy on GAIA benchmark after 3 rounds, and 116.2% relative improvement on Humanity's Last Exam (HLE)
Cross-task Skill transfer is the key driver — the effect is maximized when domain categories are clear, like in HLE (e.g., Biology Skills reused for unseen Biology questions)

Evidence

GAIA test set: Memento-Skills 66.0% vs Read-Write baseline 52.3% — 13.7pp gap
HLE test set: Memento-Skills 38.7% vs Read-Write baseline 17.9% — more than 2x difference
Memento-Qwen router: Recall@1 of 0.60 (BM25 0.32, Qwen3-Embedding 0.54) — 11% relative improvement over strongest baseline
Skill library grew from 5 initial atomic skills to 235 after HLE training, and 41 after GAIA training

How to Apply

Attach a 'Skill folder (SKILL.md + helper scripts)' as external memory to your existing agent, and add a Write loop where the LLM auto-edits Skills on each success/failure feedback — agent performance improves incrementally without fine-tuning
For domain-specific tasks (e.g., industry customer support, code review, medical QA), cluster Skill libraries by domain to maximize cross-task transfer effects
Don't implement your router with semantic embeddings alone — fine-tune with InfoNCE loss using actual execution success/failure as labels to dramatically improve hard negative discrimination (distinguishing 'looks similar but wrong' Skills)

Code Example

snippet

# Install and run Memento-Skills
git clone https://github.com/Memento-Teams/Memento-Skills.git
cd Memento-Skills
python -m venv .venv && source .venv/bin/activate
pip install -e .
memento agent

# config.json configuration example
{
  "llm": {
    "active_profile": "default",
    "profiles": {
      "default": {
        "model": "your-provider/your-model",
        "api_key": "your-api-key",
        "base_url": "https://your-api-url/v1"
      }
    }
  },
  "env": {
    "TAVILY_API_KEY": "your-search-api-key"
  }
}

# Synthetic Query generation prompt for router training (based on Appendix C)
# positive query: scenario where the target Skill should be selected
# hard negative: scenario in the same domain but where this Skill is not the best choice
prompt = """
Target skill:
- name: {skill_name}
- description: {description}

Generate:
- {n_pos} positive queries: target skill SHOULD be selected
- {n_neg} hard negative queries: same domain BUT skill is NOT the best tool

Return JSON: {"positive_queries": [...], "negative_queries": [...]}
"""

Terminology

SRDPStateful Reflective Decision Process. A theoretical framework where LLM agents store past experiences in memory and use them for future decisions. Think of it as a standard MDP (decision model) with memory bolted on.

InfoNCEA contrastive learning loss function that pulls positive examples closer and pushes negative examples further apart. Used here to teach the router 'this Skill matches this query, that one doesn't.'

Skill MemoryExternal memory store for LLM agents. Bundles code, prompts, and declarative specs (SKILL.md) into a single folder as a reusable capability unit. Think of it like a cache you can pull from and update.

continual learningA learning approach where agent capabilities keep improving as new experiences accumulate, without retraining model parameters. Like a student who keeps updating their notes to get better.

BM25A classic keyword-frequency-based text search algorithm. Pre-Google era approach that judges similarity by word overlap, but can't account for meaning or execution suitability.

KL-regularised policyA reinforcement learning technique that finds optimal actions while constraining them not to deviate too far from existing behavior. Temperature (τ) parameter controls how confidently it makes selections.

t-SNEA technique for visualizing high-dimensional data in 2D/3D. Places similar Skills in the same neighborhood on a 2D map, letting you see at a glance how the Skill library clusters.

failure attributionThe process of automatically identifying which Skill caused the failure when an agent execution fails. Similar to pinpointing which function is buggy during debugging.

Related Resources

Original Abstract (Expand)

We introduce Memento-Skills, a generalist, continually-learnable LLM agent system that functions as an agent-designing agent: it autonomously constructs, adapts, and improves task-specific agents through experience. The system is built on a memory-based reinforcement learning framework with stateful prompts, where reusable skills (stored as structured markdown files) serve as persistent, evolving memory. These skills encode both behaviour and context, enabling the agent to carry forward knowledge across interactions. Starting from simple elementary skills (like Web search and terminal operations), the agent continually improves via the Read--Write Reflective Learning mechanism introduced in Memento~2~wang2025memento2. In the read phase, a behaviour-trainable skill router selects the most relevant skill conditioned on the current stateful prompt; in the write phase, the agent updates and expands its skill library based on new experience. This closed-loop design enables continual learning without updating LLM parameters, as all adaptation is realised through the evolution of externalised skills and prompts. Unlike prior approaches that rely on human-designed agents, Memento-Skills enables a generalist agent to design agents end-to-end for new tasks. Through iterative skill generation and refinement, the system progressively improves its own capabilities. Experiments on the General AI Assistants benchmark and Humanity's Last Exam demonstrate sustained gains, achieving 26.2\% and 116.2\% relative improvements in overall accuracy, respectively. Code is available at https://github.com/Memento-Teams/Memento-Skills.