Memento-Skills: Let Agents Design Agents
TL;DR Highlight
A system where agents self-evolve by accumulating executable 'Skill' files as external memory, without touching LLM parameters
Who Should Read
ML engineers or agent system developers deploying LLM agents to production who need continual learning or per-task performance improvements. Especially useful for teams wanting to keep improving agent capabilities without fine-tuning.
Core Mechanics
- Agents grow through experience using markdown-based 'Skills' as external memory, with zero modification to LLM weights/parameters
- Implements continual learning via a Read (retrieve relevant Skills) → Act (execute) → Feedback (evaluate results) → Write (update Skills) loop
- Uses a Memento-Qwen router trained on actual execution success/failure rather than simple cosine similarity — 87.5% improvement in Recall@1 over BM25
- On failure, the LLM automatically identifies which Skill caused the issue (failure attribution) and directly edits or creates new Skill files
- Achieved 26.2% relative improvement in overall accuracy on GAIA benchmark after 3 rounds, and 116.2% relative improvement on Humanity's Last Exam (HLE)
- Cross-task Skill transfer is the key driver — the effect is maximized when domain categories are clear, like in HLE (e.g., Biology Skills reused for unseen Biology questions)
Evidence
- GAIA test set: Memento-Skills 66.0% vs Read-Write baseline 52.3% — 13.7pp gap
- HLE test set: Memento-Skills 38.7% vs Read-Write baseline 17.9% — more than 2x difference
- Memento-Qwen router: Recall@1 of 0.60 (BM25 0.32, Qwen3-Embedding 0.54) — 11% relative improvement over strongest baseline
- Skill library grew from 5 initial atomic skills to 235 after HLE training, and 41 after GAIA training
How to Apply
- Attach a 'Skill folder (SKILL.md + helper scripts)' as external memory to your existing agent, and add a Write loop where the LLM auto-edits Skills on each success/failure feedback — agent performance improves incrementally without fine-tuning
- For domain-specific tasks (e.g., industry customer support, code review, medical QA), cluster Skill libraries by domain to maximize cross-task transfer effects
- Don't implement your router with semantic embeddings alone — fine-tune with InfoNCE loss using actual execution success/failure as labels to dramatically improve hard negative discrimination (distinguishing 'looks similar but wrong' Skills)
Code Example
# Install and run Memento-Skills
git clone https://github.com/Memento-Teams/Memento-Skills.git
cd Memento-Skills
python -m venv .venv && source .venv/bin/activate
pip install -e .
memento agent
# config.json configuration example
{
"llm": {
"active_profile": "default",
"profiles": {
"default": {
"model": "your-provider/your-model",
"api_key": "your-api-key",
"base_url": "https://your-api-url/v1"
}
}
},
"env": {
"TAVILY_API_KEY": "your-search-api-key"
}
}
# Synthetic Query generation prompt for router training (based on Appendix C)
# positive query: scenario where the target Skill should be selected
# hard negative: scenario in the same domain but where this Skill is not the best choice
prompt = """
Target skill:
- name: {skill_name}
- description: {description}
Generate:
- {n_pos} positive queries: target skill SHOULD be selected
- {n_neg} hard negative queries: same domain BUT skill is NOT the best tool
Return JSON: {"positive_queries": [...], "negative_queries": [...]}
"""Terminology
Related Resources
Original Abstract (Expand)
We introduce Memento-Skills, a generalist, continually-learnable LLM agent system that functions as an agent-designing agent: it autonomously constructs, adapts, and improves task-specific agents through experience. The system is built on a memory-based reinforcement learning framework with stateful prompts, where reusable skills (stored as structured markdown files) serve as persistent, evolving memory. These skills encode both behaviour and context, enabling the agent to carry forward knowledge across interactions. Starting from simple elementary skills (like Web search and terminal operations), the agent continually improves via the Read--Write Reflective Learning mechanism introduced in Memento~2~wang2025memento2. In the read phase, a behaviour-trainable skill router selects the most relevant skill conditioned on the current stateful prompt; in the write phase, the agent updates and expands its skill library based on new experience. This closed-loop design enables continual learning without updating LLM parameters, as all adaptation is realised through the evolution of externalised skills and prompts. Unlike prior approaches that rely on human-designed agents, Memento-Skills enables a generalist agent to design agents end-to-end for new tasks. Through iterative skill generation and refinement, the system progressively improves its own capabilities. Experiments on the General AI Assistants benchmark and Humanity's Last Exam demonstrate sustained gains, achieving 26.2\% and 116.2\% relative improvements in overall accuracy, respectively. Code is available at https://github.com/Memento-Teams/Memento-Skills.