A Survey on the Memory Mechanism of Large Language Model-based Agents
TL;DR Highlight
The first memory-specialized survey paper comprehensively covering how LLM agents store and use memory.
Who Should Read
Backend/AI developers developing or designing LLM agents, especially those thinking about conversation context maintenance or long-term memory implementation. Useful for teams considering multi-agent systems or personal assistant development.
Core Mechanics
- Classifies memory sources into 3 types: info within current task (inside-trial), experience from past tasks (cross-trial), external knowledge (Wikipedia, APIs, etc.)
- Memory storage forms: Textual (natural language) and Parametric (embedded in parameters). Text is easy to interpret and modify but hits context length limits; parametric is dense but expensive to update.
- 3 stages of memory operation: Writing (store info) → Management (summarize/merge/forget) → Reading (retrieve/use). Generative Agents create high-level memories via reflection; Reflexion uses failure experiences as language-based reinforcement.
- Most current systems use text memory primarily. Various retrieval strategies in use: FAISS vector search, SQL-based search, LSH, etc.
- Systems like Reflexion, ExpeL, Synapse improve performance by using cross-trial info (past failure/success experience) to reduce repeating the same mistakes.
- Memory evaluation methods: direct (accuracy/F1/response consistency) and indirect (conversation completeness, multi-source QA, long-context task success rate) two tracks.
Evidence
- Voyager (Minecraft agent) uses cross-trial skill memory to meaningfully increase exploration items vs baseline (ablation comparison in paper)
- Reflexion: AlfWorld task success rate significantly improved vs no-memory baseline (up to 20%+ improvement reported vs ReAct)
- ExpeL extracts patterns from failed/successful trajectories and applies to new tasks — confirmed performance improvements in complex interactive tasks like AlfWorld
- MemGPT borrows OS virtual memory concept to maintain context in long conversations — improvement in user engagement (CSIM score)
How to Apply
- When building conversation agents: instead of dumping the entire history into the prompt, store conversation segments as embeddings in a vector DB like FAISS and retrieve only top-K by similarity to the current question — solves context length explosion
- When an agent fails: instead of just retrying, summarize the failure cause in natural language using Reflexion-style cross-trial memory and include in the next attempt's prompt to reduce repeating the same mistake
- For domain-specific agents (medical, financial): combining external knowledge (Wikipedia, specialized DB) via dynamic API calls with fine-tuning core domain knowledge into parameters via LoRA provides both recency and expertise
Code Example
# Reflexion-style cross-trial memory pattern example
from openai import OpenAI
client = OpenAI()
# Memory store that saves failure experiences in natural language
cross_trial_memory = []
def reflect_on_failure(task, failed_trajectory, error_feedback):
"""Analyze failure experiences to extract lessons learned"""
prompt = f"""
Task: {task}
What I tried: {failed_trajectory}
What went wrong: {error_feedback}
In 2-3 sentences, what should I do differently next time?
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
def run_agent_with_memory(task, max_trials=3):
for trial in range(max_trials):
# Include past failure experiences in the prompt
memory_context = ""
if cross_trial_memory:
memory_context = "Past experiences:\n" + "\n".join(
[f"- {m}" for m in cross_trial_memory[-3:]] # Only the last 3
)
prompt = f"""
{memory_context}
Now solve this task: {task}
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
result = response.choices[0].message.content
# On failure, reflect and save to memory
success = evaluate_result(result) # Actual evaluation logic
if not success:
lesson = reflect_on_failure(task, result, "Task failed")
cross_trial_memory.append(lesson)
print(f"Trial {trial+1} failed. Learned: {lesson}")
else:
return result
return "Max trials reached"
# Retrieved Memory pattern (using FAISS)
import faiss
import numpy as np
class AgentMemory:
def __init__(self, embedding_dim=1536):
self.index = faiss.IndexFlatL2(embedding_dim)
self.memories = []
def write(self, text, embedding):
"""Save memory"""
self.index.add(np.array([embedding]))
self.memories.append(text)
def read(self, query_embedding, top_k=3):
"""Retrieve relevant memories"""
distances, indices = self.index.search(
np.array([query_embedding]), top_k
)
return [self.memories[i] for i in indices[0] if i < len(self.memories)]Terminology
Related Resources
Original Abstract (Expand)
Large language model (LLM)-based agents have recently attracted much attention from the research and industry communities. Compared with original LLMs, LLM-based agents are featured in their self-evolving capability, which is the basis for solving real-world problems that need long-term and complex agent-environment interactions. The key component to support agent-environment interactions is the memory of the agents. While previous studies have proposed many promising memory mechanisms, they are scattered in different papers, and there lacks a systematical review to summarize and compare these works from a holistic perspective, failing to abstract common and effective designing patterns for inspiring future studies. To bridge this gap, in this article, we propose a comprehensive survey on the memory mechanism of LLM-based agents. In specific, we first discuss “what is” and “why do we need” the memory in LLM-based agents. Then, we systematically review previous studies on how to design and evaluate the memory module. In addition, we also present many agent applications, where the memory module plays an important role. At last, we analyze the limitations of existing work and show important future directions. To keep up with the latest advances in this field, we create a repository at https://github.com/nuster1128/LLM_Agent_Memory_Survey.