Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents

Jan 7, 2026•Yuanchen Bei, Tianxin Wei, Xuying Ning +7•View PDF

TL;DR Highlight

A benchmark that systematically measures how well AI maintains, reasons over, and updates memory across dozens of multi-session conversations mixing images and text.

Who Should Read

ML engineers and researchers looking to add long-term memory capabilities to multimodal AI agents or chatbots — especially developers seeking to understand the limitations of RAG-based memory systems and explore directions for improvement.

Core Mechanics

Converting images to text captions for storage is highly lossy — visual patterns must be preserved directly in memory for Test-Time Learning (the ability to adapt by seeing new examples at inference time) to work properly
Simple multimodal RAG (MuRAG) outperforms more complex memory system architectures (NGM, AUGUSTUS) in overall performance — how information is preserved matters more than architectural complexity
Storing all images as-is (Full Memory MM) yields lower performance than text-only approaches — too many visual tokens introduce noise and crowd out important text
Increasing retrieval K boosts Recall but causes Precision to drop sharply, with actual QA performance saturating or declining around K=10 — 'retrieving accurately' matters more than 'retrieving more'
Conflict Detection and Knowledge Resolution are broken across all memory systems — even switching to a stronger backbone model yields minimal improvement, pointing to the need for fundamental design changes
Using a stronger backbone (e.g., Gemini-2.5-Flash-Lite) improves extraction performance, but reasoning and knowledge management limitations remain largely unchanged — the problem lies in the memory architecture itself, not model size

Evidence

MuRAG achieves +11.85% overall F1, +12.29% on visually-grounded retrieval (VS), and +29.06% on Test-Time Learning (TTL) compared to the best-performing text-only memory baseline
Full Memory (Multimodal) shows -8.08% overall F1 vs. Full Memory (Text) and -51.85% vs. MuRAG — blindly stacking images is counterproductive
Increasing K from 10 to 20 raises MuRAG Recall@K from 86% to 92%, but overall QA F1 slightly drops or stagnates
Across all 13 memory systems, the highest Conflict Detection F1 is approximately 0.37 — barely above random baseline

How to Apply

When building RAG-based memory, don't store images as captions alone — also index the original images as embedding vectors. This makes a significant difference especially in scenarios where users reference 'that photo you showed me last time'
Rather than blindly increasing retrieval K, fix it around K=10 and first try adding a Precision-focused reranking layer on top
When adding information-correction functionality to a chatbot (e.g., 'What I said earlier was wrong, actually...'), implement explicit logic to delete and update existing memory entries rather than simply appending — every current system fails at this

Code Example

snippet

# Mem-Gallery benchmark evaluation environment setup (MemEngine-based)
# https://github.com/nuster1128/MemEngine

from memengine import MemoryAgent, MuRAGConfig

# Initialize multimodal memory agent
config = MuRAGConfig(
    embedding_model="GME-Qwen2-VL-2B-Instruct",
    retrieval_k=10,  # K=10 is optimal for precision/recall trade-off
    store_raw_images=True  # Preserve image embeddings alongside captions, not captions alone
)
agent = MemoryAgent(backbone="Qwen2.5-VL-7B-Instruct", memory_config=config)

# Stream conversation session by session
for session in multi_session_conversation:
    for turn in session.turns:
        agent.observe(text=turn.text, image=turn.image)  # Multimodal input
        
# Query at evaluation time
result = agent.query(
    question="What breed was the dog in the photo you showed me last time?",
    query_image=current_image  # Include visual query as well
)
print(result)

Terminology

MLLMA large language model that understands not just text but also images. Think of it as an AI with eyes — like GPT-4V, Gemini, or Qwen-VL.

Multi-session MemoryThe ability to retain information across multiple conversation sessions — remembering what was discussed today in a conversation next week.

Test-Time LearningThe ability to adapt on the fly at inference time by seeing new examples, without changing model parameters. Similar to looking at worked examples during an exam to solve new problems.

MuRAGMultimodal Retrieval-Augmented Generation. An approach that encodes both images and text as vectors to retrieve relevant memories and generate answers.

Knowledge ResolutionThe ability to correctly update memory when a user corrects previously provided information during a conversation — properly reflecting 'What I said earlier was wrong'.

Conflict DetectionThe ability to detect when incoming information conflicts with existing memory — noticing that 'this person said A before but is now saying B'.

Answer RefusalThe ability to appropriately decline to answer when the relevant information is absent from the conversation history. Directly tied to preventing hallucination — making up memories that don't exist.

Related Resources

Original Abstract (Expand)

Long-term memory is a critical capability for multimodal large language model (MLLM) agents, particularly in conversational settings where information accumulates and evolves over time. However, existing benchmarks either evaluate multi-session memory in text-only conversations or assess multimodal understanding within localized contexts, failing to evaluate how multimodal memory is preserved, organized, and evolved across long-term conversational trajectories. Thus, we introduce Mem-Gallery, a new benchmark for evaluating multimodal long-term conversational memory in MLLM agents. Mem-Gallery features high-quality multi-session conversations grounded in both visual and textual information, with long interaction horizons and rich multimodal dependencies. Building on this dataset, we propose a systematic evaluation framework that assesses key memory capabilities along three functional dimensions: memory extraction and test-time adaptation, memory reasoning, and memory knowledge management. Extensive benchmarking across thirteen memory systems reveals several key findings, highlighting the necessity of explicit multimodal information retention and memory organization, the persistent limitations in memory reasoning and knowledge management, as well as the efficiency bottleneck of current models.