Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents
TL;DR Highlight
A benchmark that systematically measures how well AI maintains, reasons over, and updates memory across dozens of multi-session conversations mixing images and text.
Who Should Read
ML engineers and researchers looking to add long-term memory capabilities to multimodal AI agents or chatbots — especially developers seeking to understand the limitations of RAG-based memory systems and explore directions for improvement.
Core Mechanics
- Converting images to text captions for storage is highly lossy — visual patterns must be preserved directly in memory for Test-Time Learning (the ability to adapt by seeing new examples at inference time) to work properly
- Simple multimodal RAG (MuRAG) outperforms more complex memory system architectures (NGM, AUGUSTUS) in overall performance — how information is preserved matters more than architectural complexity
- Storing all images as-is (Full Memory MM) yields lower performance than text-only approaches — too many visual tokens introduce noise and crowd out important text
- Increasing retrieval K boosts Recall but causes Precision to drop sharply, with actual QA performance saturating or declining around K=10 — 'retrieving accurately' matters more than 'retrieving more'
- Conflict Detection and Knowledge Resolution are broken across all memory systems — even switching to a stronger backbone model yields minimal improvement, pointing to the need for fundamental design changes
- Using a stronger backbone (e.g., Gemini-2.5-Flash-Lite) improves extraction performance, but reasoning and knowledge management limitations remain largely unchanged — the problem lies in the memory architecture itself, not model size
Evidence
- MuRAG achieves +11.85% overall F1, +12.29% on visually-grounded retrieval (VS), and +29.06% on Test-Time Learning (TTL) compared to the best-performing text-only memory baseline
- Full Memory (Multimodal) shows -8.08% overall F1 vs. Full Memory (Text) and -51.85% vs. MuRAG — blindly stacking images is counterproductive
- Increasing K from 10 to 20 raises MuRAG Recall@K from 86% to 92%, but overall QA F1 slightly drops or stagnates
- Across all 13 memory systems, the highest Conflict Detection F1 is approximately 0.37 — barely above random baseline
How to Apply
- When building RAG-based memory, don't store images as captions alone — also index the original images as embedding vectors. This makes a significant difference especially in scenarios where users reference 'that photo you showed me last time'
- Rather than blindly increasing retrieval K, fix it around K=10 and first try adding a Precision-focused reranking layer on top
- When adding information-correction functionality to a chatbot (e.g., 'What I said earlier was wrong, actually...'), implement explicit logic to delete and update existing memory entries rather than simply appending — every current system fails at this
Code Example
# Mem-Gallery benchmark evaluation environment setup (MemEngine-based)
# https://github.com/nuster1128/MemEngine
from memengine import MemoryAgent, MuRAGConfig
# Initialize multimodal memory agent
config = MuRAGConfig(
embedding_model="GME-Qwen2-VL-2B-Instruct",
retrieval_k=10, # K=10 is optimal for precision/recall trade-off
store_raw_images=True # Preserve image embeddings alongside captions, not captions alone
)
agent = MemoryAgent(backbone="Qwen2.5-VL-7B-Instruct", memory_config=config)
# Stream conversation session by session
for session in multi_session_conversation:
for turn in session.turns:
agent.observe(text=turn.text, image=turn.image) # Multimodal input
# Query at evaluation time
result = agent.query(
question="What breed was the dog in the photo you showed me last time?",
query_image=current_image # Include visual query as well
)
print(result)Terminology
Related Resources
Original Abstract (Expand)
Long-term memory is a critical capability for multimodal large language model (MLLM) agents, particularly in conversational settings where information accumulates and evolves over time. However, existing benchmarks either evaluate multi-session memory in text-only conversations or assess multimodal understanding within localized contexts, failing to evaluate how multimodal memory is preserved, organized, and evolved across long-term conversational trajectories. Thus, we introduce Mem-Gallery, a new benchmark for evaluating multimodal long-term conversational memory in MLLM agents. Mem-Gallery features high-quality multi-session conversations grounded in both visual and textual information, with long interaction horizons and rich multimodal dependencies. Building on this dataset, we propose a systematic evaluation framework that assesses key memory capabilities along three functional dimensions: memory extraction and test-time adaptation, memory reasoning, and memory knowledge management. Extensive benchmarking across thirteen memory systems reveals several key findings, highlighting the necessity of explicit multimodal information retention and memory organization, the persistent limitations in memory reasoning and knowledge management, as well as the efficiency bottleneck of current models.