F2LLM-v2: 200개 이상 언어를 지원하는 다국어 Embedding 모델 패밀리

F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

Mar 19, 2026•Ziyin Zhang, Zihan Liao, Hang Yu +2•View PDF

TL;DR Highlight

영어 편향 없이 200개 언어를 지원하면서 Qwen3-Embedding보다 작은 사이즈에서 더 나은 성능을 내는 오픈소스 임베딩 모델 8종 세트.

Who Should Read

RAG 파이프라인에서 한국어/일본어/아랍어 등 비영어권 다국어 임베딩 모델을 찾고 있는 백엔드/ML 엔지니어. 엣지 환경에서 경량 임베딩 모델 배포를 고민하는 개발자.

Core Mechanics

80M~14B까지 8가지 크기로 제공. 서버급(14B)부터 엣지 디바이스(80M)까지 상황에 맞게 선택 가능.
282개 자연어 + 40개 이상 프로그래밍 언어 지원. 기존 모델들이 영어/중국어에 치우친 것과 달리 스페인어, 아랍어, 페르시아어, 베트남어 등 저자원 언어에도 대규모 데이터 포함.
Matryoshka Representation Learning(MRL, 임베딩 벡터를 작게 잘라도 성능이 유지되는 기법) 적용으로 벡터 차원을 8~4096 사이에서 자유롭게 선택 가능. 예: 330M 모델의 풀 896차원 임베딩이 14B 모델을 32차원으로 잘랐을 때와 비슷한 성능.
모델 프루닝(pruning, 큰 모델에서 불필요한 부분을 잘라내는 기법) + Knowledge Distillation(작은 모델이 큰 모델을 흉내 내도록 학습)으로 소형 모델 성능 방어. 蒸留 미적용 대비 최대 4.67점 향상.
F2LLM-v2-14B가 17개 MTEB 벤치마크 중 11개에서 1위. 0.6B 모델은 같은 사이즈 Qwen3-Embedding-0.6B 대비 Code, Scandinavian, Indic, French 등 대부분 언어별 벤치마크에서 우세.
모델, 학습 데이터, 코드, 중간 체크포인트까지 전부 공개. Gemini-Embedding, Qwen3-Embedding이 학습 방법론을 공개하지 않는 것과 대조적.

Evidence

F2LLM-v2-14B가 MTEB 17개 벤치마크 중 11개(Polish 1위, Indic 1위, European 1위, Scandinavian 1위, Code 1위 등)에서 1위.
Knowledge Distillation 적용 시 미적용 대비 80M: 58.04 vs 53.37, 330M: 64.55 vs 62.77, 1.7B: 69.13 vs 68.58 (350개 태스크 평균 기준).
F2LLM-v2-0.6B vs Qwen3-Embedding-0.6B: Code(77.41 vs 75.42), Scandinavian(64.32 vs 60.99), Indic(70.11 vs 66.53), French(68.14 vs 63.01)로 전반적 우세.
F2LLM-v2-330M vs EmbeddingGemma-0.3B: 평균 64.41 vs 59.55, Code(75.74 vs 68.76), Scandinavian(61.93 vs 54.39)에서 큰 격차.

How to Apply

RAG 파이프라인에서 한국어/일본어/페르시아어 등 비영어권 문서를 인덱싱할 때, F2LLM-v2-1.7B나 4B로 교체하면 기존 multilingual-e5-large 대비 해당 언어 벤치마크 성능이 크게 오를 수 있다.
메모리 제한이 있는 엣지 서버에서는 MRL 덕분에 F2LLM-v2-1.7B 모델을 로드한 뒤 임베딩 차원을 128~256으로 truncate해서 사용하면, 풀 차원 대비 저장/계산 비용을 줄이면서도 준수한 성능을 유지할 수 있다.
코드 검색 기능이 필요한 서비스라면 F2LLM-v2-4B 이상을 권장. Code MTEB에서 80.15~80.75점으로 1~6위권이며 Python, Java, Go 등 40개 이상 언어 커버.

Code Example

snippet

from sentence_transformers import SentenceTransformer

# 모델 로드 (HuggingFace: codefuse-ai/F2LLM-v2-1.7B)
model = SentenceTransformer('codefuse-ai/F2LLM-v2-1.7B', trust_remote_code=True)

# 기본 임베딩 (풀 차원)
sentences = [
    '안녕하세요, 오늘 날씨가 좋네요.',
    'Hello, the weather is nice today.',
    'مرحبا، الطقس جميل اليوم.',
]
embeddings = model.encode(sentences)
print(f'Full embedding shape: {embeddings.shape}')  # (3, 2048)

# MRL: 128차원으로 truncate해서 경량 사용
embeddings_small = embeddings[:, :128]
print(f'Truncated embedding shape: {embeddings_small.shape}')  # (3, 128)

# Retrieval용 instruction 적용 (쿼리에만)
query = '날씨에 관한 인사말을 찾아줘'
query_embedding = model.encode(
    query,
    prompt='Instruct: Given a query, retrieve relevant passages\nQuery: '
)

Terminology

MTEB텍스트 임베딩 모델을 500개 이상 태스크로 평가하는 표준 벤치마크. 개발자 세계에서 임베딩 모델 성능표 역할을 함.

Matryoshka Representation Learning (MRL)러시아 마트료시카 인형처럼 큰 벡터 안에 작은 벡터가 내포되도록 학습하는 기법. 4096차원 임베딩을 만들어도 앞 128차원만 잘라 쓰면 어느 정도 성능이 나옴.

Knowledge Distillation큰 모델(선생)의 출력을 흉내 내도록 작은 모델(학생)을 학습시키는 기법. 작은 모델이 혼자 공부하는 것보다 훨씬 빠르게 좋은 성능을 냄.

Pruning모델에서 덜 중요한 뉴런이나 레이어를 잘라내서 모델을 경량화하는 기법. 살 빼듯 모델 크기를 줄임.

Contrastive Learning비슷한 텍스트 쌍은 벡터 공간에서 가깝게, 다른 텍스트 쌍은 멀게 배치하도록 학습하는 방식. 임베딩 모델 학습의 핵심 방법.

Hard Negative얼핏 보면 정답처럼 보이지만 실제로는 틀린 예시. 이걸 학습에 포함하면 모델이 더 세밀하게 구분하는 법을 배움.

EOS Token문장의 끝을 나타내는 특수 토큰. 이 논문에서는 디코더 LLM의 EOS 토큰 위치에서 나온 벡터를 문장 전체의 임베딩으로 사용함.

Related Resources

Original Abstract (Expand)

We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.