A Survey on the Memory Mechanism of Large Language Model-based Agents

Apr 21, 2024•Zeyu Zhang, Quanyu Dai, Xiaohe Bo +6•View PDF

TL;DR Highlight

The first memory-specialized survey paper comprehensively covering how LLM agents store and use memory.

Who Should Read

Backend/AI developers developing or designing LLM agents, especially those thinking about conversation context maintenance or long-term memory implementation. Useful for teams considering multi-agent systems or personal assistant development.

Core Mechanics

Classifies memory sources into 3 types: info within current task (inside-trial), experience from past tasks (cross-trial), external knowledge (Wikipedia, APIs, etc.)
Memory storage forms: Textual (natural language) and Parametric (embedded in parameters). Text is easy to interpret and modify but hits context length limits; parametric is dense but expensive to update.
3 stages of memory operation: Writing (store info) → Management (summarize/merge/forget) → Reading (retrieve/use). Generative Agents create high-level memories via reflection; Reflexion uses failure experiences as language-based reinforcement.
Most current systems use text memory primarily. Various retrieval strategies in use: FAISS vector search, SQL-based search, LSH, etc.
Systems like Reflexion, ExpeL, Synapse improve performance by using cross-trial info (past failure/success experience) to reduce repeating the same mistakes.
Memory evaluation methods: direct (accuracy/F1/response consistency) and indirect (conversation completeness, multi-source QA, long-context task success rate) two tracks.

Evidence

Voyager (Minecraft agent) uses cross-trial skill memory to meaningfully increase exploration items vs baseline (ablation comparison in paper)
Reflexion: AlfWorld task success rate significantly improved vs no-memory baseline (up to 20%+ improvement reported vs ReAct)
ExpeL extracts patterns from failed/successful trajectories and applies to new tasks — confirmed performance improvements in complex interactive tasks like AlfWorld
MemGPT borrows OS virtual memory concept to maintain context in long conversations — improvement in user engagement (CSIM score)

How to Apply

When building conversation agents: instead of dumping the entire history into the prompt, store conversation segments as embeddings in a vector DB like FAISS and retrieve only top-K by similarity to the current question — solves context length explosion
When an agent fails: instead of just retrying, summarize the failure cause in natural language using Reflexion-style cross-trial memory and include in the next attempt's prompt to reduce repeating the same mistake
For domain-specific agents (medical, financial): combining external knowledge (Wikipedia, specialized DB) via dynamic API calls with fine-tuning core domain knowledge into parameters via LoRA provides both recency and expertise

Code Example

snippet

# Reflexion-style cross-trial memory pattern example

from openai import OpenAI

client = OpenAI()

# Memory store that saves failure experiences in natural language
cross_trial_memory = []

def reflect_on_failure(task, failed_trajectory, error_feedback):
    """Analyze failure experiences to extract lessons learned"""
    prompt = f"""
    Task: {task}
    What I tried: {failed_trajectory}
    What went wrong: {error_feedback}
    
    In 2-3 sentences, what should I do differently next time?
    """
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

def run_agent_with_memory(task, max_trials=3):
    for trial in range(max_trials):
        # Include past failure experiences in the prompt
        memory_context = ""
        if cross_trial_memory:
            memory_context = "Past experiences:\n" + "\n".join(
                [f"- {m}" for m in cross_trial_memory[-3:]]  # Only the last 3
            )
        
        prompt = f"""
        {memory_context}
        
        Now solve this task: {task}
        """
        
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )
        
        result = response.choices[0].message.content
        
        # On failure, reflect and save to memory
        success = evaluate_result(result)  # Actual evaluation logic
        if not success:
            lesson = reflect_on_failure(task, result, "Task failed")
            cross_trial_memory.append(lesson)
            print(f"Trial {trial+1} failed. Learned: {lesson}")
        else:
            return result
    
    return "Max trials reached"

# Retrieved Memory pattern (using FAISS)
import faiss
import numpy as np

class AgentMemory:
    def __init__(self, embedding_dim=1536):
        self.index = faiss.IndexFlatL2(embedding_dim)
        self.memories = []
    
    def write(self, text, embedding):
        """Save memory"""
        self.index.add(np.array([embedding]))
        self.memories.append(text)
    
    def read(self, query_embedding, top_k=3):
        """Retrieve relevant memories"""
        distances, indices = self.index.search(
            np.array([query_embedding]), top_k
        )
        return [self.memories[i] for i in indices[0] if i < len(self.memories)]

Terminology

cross-trial memoryExperience accumulated from multiple attempts at the same task — success and failure records. Like a human learning 'this approach doesn't work' from multiple experiments.

ReflexionA technique of writing a post-mortem when failing — explaining why in language and referencing it next attempt. Like taking notes on wrong answers and reviewing them.

parametric memoryStoring knowledge by embedding it into model parameters (weights) rather than as separate text. Like knowledge naturally absorbed into the brain without separate storage space.

FAISSFacebook's high-speed vector similarity search library. Index system for quickly finding similar embeddings among millions.

Generative AgentsAgents that simulate human memory through periodic reflection and create high-level abstract memories from lower-level experiences.

Related Resources

https://github.com/nuster1128/LLM_Agent_Memory_Survey

Original Abstract (Expand)

Large language model (LLM)-based agents have recently attracted much attention from the research and industry communities. Compared with original LLMs, LLM-based agents are featured in their self-evolving capability, which is the basis for solving real-world problems that need long-term and complex agent-environment interactions. The key component to support agent-environment interactions is the memory of the agents. While previous studies have proposed many promising memory mechanisms, they are scattered in different papers, and there lacks a systematical review to summarize and compare these works from a holistic perspective, failing to abstract common and effective designing patterns for inspiring future studies. To bridge this gap, in this article, we propose a comprehensive survey on the memory mechanism of LLM-based agents. In specific, we first discuss “what is” and “why do we need” the memory in LLM-based agents. Then, we systematically review previous studies on how to design and evaluate the memory module. In addition, we also present many agent applications, where the memory module plays an important role. At last, we analyze the limitations of existing work and show important future directions. To keep up with the latest advances in this field, we create a repository at https://github.com/nuster1128/LLM_Agent_Memory_Survey.