MMRC: 실제 대화 환경에서 Multimodal Large Language Model을 평가하는 대규모 Benchmark

MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation

Feb 17, 2025•Haochen Xue, Feilong Tang, Ming Hu +13•View PDF

TL;DR Highlight

GPT-4o를 포함한 20개 멀티모달 AI가 긴 대화에서 얼마나 기억력이 떨어지는지 측정한 벤치마크 + 간단한 해결책 제시

Who Should Read

멀티모달 AI 챗봇이나 대화형 AI 서비스를 개발 중인 엔지니어. 특히 장기 대화에서 모델의 기억·추론 성능을 높이고 싶은 개발자.

Core Mechanics

5,120개의 실제 유저 대화와 28,720개의 수동 레이블 질문으로 구성된 MMRC 벤치마크 공개. 기존 벤치마크와 달리 GPT 생성이 아닌 진짜 유저 대화 사용
평가 항목 6가지: 정보 추출(IE), 멀티턴 추론(CR), 정보 업데이트(IU), 이미지 관리(IM), 장기 기억 회상(MR), 답변 거절(AR)
GPT-4o도 인간 수준(4.81점)에 크게 못 미치는 4.08점, 오픈소스 모델 평균은 2.55점으로 실제 대화에서 성능이 급격히 떨어짐
4가지 공통 실패 패턴 발견: ① 대화 중반부 기억 소실 ② 업데이트된 정보 반영 실패 ③ 오류가 뒤 턴으로 전파 ④ 모르면 거절 못하고 억지 답변 생성
모델 크기가 클수록 전반적 성능은 높지만 오히려 '모른다고 거절하는 능력'은 더 낮아지는 역설적 현상 발견
NOTE-TAKING 전략: 대화 중 핵심 정보를 JSON 형태로 실시간 기록해 응답 시 참고하면, LLaVA-1.5-7B 기준 기억 회상 점수가 0.22 → 2.36으로 대폭 향상

Evidence

GPT-4o의 MMRC 전체 점수 4.08 vs 인간 4.81. 오픈소스 모델 평균은 2.55점으로 인간 대비 절반 수준
NOTE-TAKING 적용 시 LLaVA-1.5-7B의 기억 회상(MR) 점수: 0.22 → 2.36(+2.14), 정보 추출(IE): 0.91 → 2.57(+1.66)
대화 길이가 4~6턴일 때 IE 점수 4.89이지만 20~22턴으로 늘어나면 1.07로 급락. 대화가 길어질수록 기억력이 선형이 아닌 급격히 저하
텍스트 기반 대화가 이미지 기반 대화 대비 전체 점수 26.3% 높고, 기억 관련 능력(IE, IM, MR)은 34.6% 더 높음

How to Apply

장기 대화 챗봇을 만들 때 NOTE-TAKING 패턴 적용: 유저 메시지가 들어올 때마다 별도 LLM(또는 같은 모델)에게 핵심 정보를 JSON으로 추출·업데이트하게 하고, 응답 시 이 노트를 컨텍스트에 포함시키면 기억력 문제를 완화할 수 있음
멀티모달 챗봇에서 이미지가 포함된 대화를 처리할 때, 이미지 내용을 텍스트로 요약해 노트에 함께 기록하면 이미지 관리(IM) 점수가 개선됨 (이미지 기반이 텍스트 기반보다 26% 낮기 때문)
모델이 '모른다'고 거절해야 하는 상황에서 억지 답변을 내놓는 문제는, 시스템 프롬프트에 '대화에서 언급되지 않은 내용은 반드시 거절하라'는 명시적 지시를 추가하면 개선 가능

Code Example

snippet

# NOTE-TAKING 전략 구현 예시

SYSTEM_PROMPT_NOTETAKER = """
You are an intelligent chatbot designed to accurately record information from conversations in JSON format.
- Focus on the user's preferences and the facts within the conversation.
- If the user presents a new fact, it should overwrite the outdated fact.
- REMEMBER NOT TO RESPOND TO THE USER'S INPUT!
"""

USER_PROMPT_NOTETAKER = """
Extract key information from the input, including the user's preferences, events, and facts.
If you detect a change in information, overwrite the outdated content.

Here is the user input: {user_input}

Provide the Note in JSON format, for example:
{{
  "user_info": {{
    "health_conditions": ["High blood pressure", "Dairy allergy"]
  }},
  "preferences": {{
    "food_preference": "Italian cuisine"
  }},
  "purpose": "Seeking meal suggestions."
}}
"""

def take_note(client, user_input, current_note):
    """유저 메시지에서 핵심 정보를 추출해 노트 업데이트"""
    prompt = USER_PROMPT_NOTETAKER.format(user_input=user_input)
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT_NOTETAKER},
            {"role": "user", "content": f"Current note: {current_note}\n\n{prompt}"}
        ]
    )
    return response.choices[0].message.content

def chat_with_note(client, history, question, note):
    """노트를 컨텍스트로 포함해 응답 생성"""
    augmented_history = history + [
        {"role": "system", "content": f"[Conversation Notes]\n{note}"}
    ]
    augmented_history.append({"role": "user", "content": question})
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=augmented_history
    )
    return response.choices[0].message.content

# 사용 예시
note = {}
history = []
for user_message in conversation_turns:
    # 1. 노트 업데이트
    note = take_note(client, user_message, note)
    # 2. 응답 생성
    response = chat_with_note(client, history, user_message, note)
    history.append({"role": "user", "content": user_message})
    history.append({"role": "assistant", "content": response})

Terminology

MLLM텍스트뿐 아니라 이미지도 이해할 수 있는 대형 AI 모델. GPT-4o, Gemini처럼 사진을 보여주며 대화할 수 있는 모델.

BenchmarkAI 모델의 성능을 객관적으로 비교하기 위한 시험 문제 모음. 수능처럼 모든 모델이 같은 문제를 풀고 점수를 비교함.

Multi-turn Reasoning여러 번의 대화를 거치며 앞에서 나온 정보를 종합해 답하는 능력. '아까 말한 A랑 방금 말한 B를 합쳐서 생각해봐'를 잘 하는 것.

Information Update대화 중 유저가 이전에 말한 정보를 수정했을 때, AI가 이를 인식하고 업데이트하는 능력. '아 미안, 사실은 B야'라고 했을 때 A를 잊고 B로 기억하는 것.

Answer RefusalAI가 대화에서 언급되지 않은 내용을 질문받았을 때 '모른다'고 솔직하게 거절하는 능력. 없는 정보를 억지로 만들어내지 않는 것.

NOTE-TAKING대화 중 중요한 정보를 실시간으로 메모해두는 전략. 사람이 긴 회의에서 노트 필기하는 것처럼, AI가 대화 내용을 JSON으로 정리해 기억력 문제를 보완.

Open-ended Conversation미리 정해진 시나리오 없이 유저가 원하는 대로 자유롭게 진행하는 대화. 챗봇처럼 정해진 메뉴를 고르는 게 아니라 무슨 말이든 할 수 있는 형태.

Related Resources

MMRC GitHub Repository

Original Abstract (Expand)

Recent multimodal large language models (MLLMs) have demonstrated significant potential in open-ended conversation, generating more accurate and personalized responses. However, their abilities to memorize, recall, and reason in sustained interactions within real-world scenarios remain underexplored. This paper introduces MMRC, a Multi-Modal Real-world Conversation benchmark for evaluating six core open-ended abilities of MLLMs: information extraction, multi-turn reasoning, information update, image management, memory recall, and answer refusal. With data collected from real-world scenarios, MMRC comprises 5,120 conversations and 28,720 corresponding manually labeled questions, posing a significant challenge to existing MLLMs. Evaluations on 20 MLLMs in MMRC indicate an accuracy drop during open-ended interactions. We identify four common failure patterns: long-term memory degradation, inadequacies in updating factual knowledge, accumulated assumption of error propagation, and reluctance to say no. To mitigate these issues, we propose a simple yet effective NOTE-TAKING strategy, which can record key information from the conversation and remind the model during its responses, enhancing conversational capabilities. Experiments across six MLLMs demonstrate significant performance improvements.