MMTEB: 대규모 다국어 텍스트 임베딩 벤치마크

MMTEB: Massive Multilingual Text Embedding Benchmark

Feb 19, 2025•Kenneth C. Enevoldsen, Isaac Chung, Imene Kerboua +83•View PDF

TL;DR Highlight

500개 태스크, 250개 언어를 커버하는 역대 최대 규모 임베딩 평가 벤치마크가 나왔고, 7B 모델보다 560M짜리가 더 잘한다는 결론이 나왔다.

Who Should Read

RAG 파이프라인이나 시맨틱 검색에 사용할 임베딩 모델을 고르고 있는 ML 엔지니어. 특히 한국어·인도어·유럽 저자원 언어를 포함한 다국어 서비스를 개발 중인 백엔드 개발자.

Core Mechanics

기존 MTEB 58개 태스크를 500개 이상으로 확장, 250개 이상 언어 커버 — 임베딩 모델 평가 역대 최대 규모
가장 성능 좋은 공개 모델은 7B LLM이 아니라 560M짜리 multilingual-e5-large-instruct — 파라미터 수 ≠ 다국어 성능
GritLM-7B, e5-mistral-7b-instruct 같은 Mistral 기반 7B 모델은 영어에서 강하지만 저자원 언어에서는 XLM-R 기반 소형 모델에 밀림
Instruction tuning(지시문 기반 학습)이 없는 모델 대비 있는 모델이 bitext mining, 클러스터링 등 모든 카테고리에서 일관되게 우세
원본 문서의 2%만 써도 모델 랭킹이 동일하게 유지되는 태스크 다운샘플링 방법 개발 — 평가 비용 대폭 절감
태스크 간 상관관계 기반으로 중복 태스크를 제거해 벤치마크를 압축하는 방법론 제시 — 커스텀 벤치마크 자체 제작 가능

Evidence

MTEB(eng, v2)는 56개 → 41개 태스크로 줄었지만 원본과 Spearman 상관계수 0.90(p<0.0001) 유지
클러스터링 평가 평균 16.11배 속도 향상, 모델 랭킹 Spearman 상관계수 평균 0.96로 랭킹 보존
7B 모델 기준 H100 GPU 3.11시간으로 전체 벤치마크 평가 완료 (원본 A100 기준 이틀 소요 대비)
multilingual-e5-large-instruct가 MTEB(Multilingual) 132개 태스크 평균 63.2점으로 GritLM-7B(60.9), e5-mistral-7b-instruct(60.3)를 앞섬

How to Apply

RAG 파이프라인에 임베딩 모델 선택 시 영어 MTEB 점수만 보지 말고 HuggingFace MTEB 리더보드에서 언어별 서브벤치마크(MTEB(Multilingual), MTEB(Europe) 등) 점수를 확인할 것
저자원 언어(한국어, 힌디어 등)가 포함된 서비스라면 파라미터 수보다 XLM-R 기반 여부와 다국어 사전학습 데이터 규모를 우선 고려 — multilingual-e5-large-instruct가 실용적 기본값
자체 도메인 평가 벤치마크를 만들 때 mteb 패키지의 task_selection 기능을 활용하면 태스크 간 상관관계 기반으로 최소 태스크셋을 자동 선정 가능

Code Example

snippet

import mteb
from mteb.task_selection import results_to_dataframe

# 특정 언어/도메인 태스크만 선택해서 기존 결과 로드
tasks = mteb.get_tasks(
    task_types=["Retrieval"],
    languages=["kor", "eng"],  # 한국어 + 영어
    domains=["legal"]          # 법률 도메인
)

model_names = [
    "intfloat/multilingual-e5-large-instruct",
    "intfloat/multilingual-e5-large",
    "intfloat/multilingual-e5-base",
]
models = [mteb.get_model_meta(name) for name in model_names]
results = mteb.load_results(models=models, tasks=tasks)
df = results_to_dataframe(results)
print(df.sort_values("score", ascending=False))

Terminology

텍스트 임베딩문장이나 단어를 숫자 벡터로 변환하는 기술. '사과'와 '과일'이 가까운 벡터가 되도록 학습시켜 의미 기반 검색이 가능하게 만든다.

MTEB텍스트 임베딩 모델을 분류, 검색, 클러스터링 등 다양한 태스크로 평가하는 벤치마크. 임베딩 모델계의 표준 성적표.

Instruction tuning모델 학습 시 '이 문서가 이 질문에 답할 수 있나?'처럼 지시문을 함께 넣어 학습하는 기법. 일반 임베딩보다 맥락 이해가 좋아진다.

XLM-RMeta가 100개 언어를 동시에 학습시킨 다국어 사전학습 모델. BERT 구조 기반으로 저자원 언어에 특히 강하다.

Bitext mining서로 다른 언어로 된 텍스트 중 같은 의미의 쌍을 찾는 태스크. 번역 데이터를 자동으로 수집할 때 활용된다.

Spearman 상관계수두 랭킹이 얼마나 비슷한지 측정하는 지표. 1.0이면 완전히 동일한 순위, 0이면 무관계.

Borda count여러 투표자(태스크)의 선호 순위를 합산해 최종 순위를 매기는 방식. 단순 점수 평균보다 모델 간 순위 비교에 더 강건하다.

Related Resources

Original Abstract (Expand)

Text embeddings are typically evaluated on a limited set of tasks, which are constrained by language, domain, and task diversity. To address these limitations and provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) - a large-scale, community-driven expansion of MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes a diverse set of challenging, novel tasks such as instruction following, long-document retrieval, and code retrieval, representing the largest multilingual collection of evaluation tasks for embedding models to date. Using this collection, we develop several highly multilingual benchmarks, which we use to evaluate a representative set of models. We find that while large language models (LLMs) with billions of parameters can achieve state-of-the-art performance on certain language subsets and task categories, the best-performing publicly available model is multilingual-e5-large-instruct with only 560 million parameters. To facilitate accessibility and reduce computational cost, we introduce a novel downsampling method based on inter-task correlation, ensuring a diverse selection while preserving relative model rankings. Furthermore, we optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks that drastically reduce computational demands. For instance, our newly introduced zero-shot English benchmark maintains a ranking order similar to the full-scale version but at a fraction of the computational cost.