BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs

Mar 12, 2026•Ilias Aarab•View PDF

TL;DR Highlight

Which model to use for zero-shot text classification without labeled data — direct comparison of 38 models across 22 datasets.

Who Should Read

ML engineers and backend devs building text classification pipelines without labeling costs. Especially for those wanting to handle sentiment analysis, intent detection, and topic classification zero-shot.

Core Mechanics

Qwen3-Reranker-8B takes #1 overall with macro F1 0.72 — F1 +12 points, accuracy +14 points above the previous best NLI cross-encoder
Embedding models have the best speed-vs-accuracy trade-off — gte-large-en-v1.5 (F1 0.62) beats all NLI cross-encoders while being faster
NLI cross-encoders (BART-MNLI-style structures) hit a performance ceiling regardless of model size — best is deberta-v3-large-nli-triplet at F1 0.60
LLMs become competitive from 4B+ — Qwen3-4B at F1 0.65, Mistral-Nemo-12B at 0.67, but still 5 points below Qwen3-Reranker-8B
Embedding model scaling has almost no effect — Qwen3-Embedding-8B (F1 0.59) vs 0.6B (F1 0.58), negligible difference
Emotion classification is the hardest task — all model families score F1 0.25-0.5, far lower than topic/sentiment tasks

Evidence

Qwen3-Reranker-8B: macro F1 0.72, accuracy 0.76 — #1 of 38 models, +12 F1 over NLI best deberta-v3-large-nli-triplet (F1 0.60)
gte-large-en-v1.5: macro F1 0.62, beats all NLI cross-encoders while inference speed is much faster (Pareto-efficient)
BTZSC vs MTEB ranking Kendall τ = 0.69 (p < 10^-8) — strong agreement between the two benchmark model rankings, confirming result reliability
Qwen3-Reranker-0.6B (F1 0.61) alone exceeds all NLI cross-encoders by F1 — even small rerankers are stronger than the old approach

How to Apply

For real-time classification services where latency matters, use gte-large-en-v1.5 or gte-modernbert-base as a cosine similarity-based zero-shot classifier directly — verbalize each label as "The sentiment of this review is {label}" and compute cosine similarity with text embeddings
When accuracy is the top priority and batch processing is possible, use Qwen3-Reranker-8B — put the input text as query and verbalized labels as documents, classify by reranking score
If you're already using Qwen3 family in your LLM pipeline, use Qwen3-4B as a zero-shot classifier with a multiple-choice prompt — attach A/B/C options to each label and select by next-token probability

Code Example

snippet

# Zero-shot sentiment classification example with gte-large-en-v1.5
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('Alibaba-NLP/gte-large-en-v1.5', trust_remote_code=True)

text = "This product exceeded all my expectations. Highly recommend!"

# Label verbalization
label_descriptions = [
    "The overall sentiment within the Amazon product review is positive",
    "The overall sentiment within the Amazon product review is negative"
]

# Compute embeddings
text_emb = model.encode(text, normalize_embeddings=True)
label_embs = model.encode(label_descriptions, normalize_embeddings=True)

# Classify using cosine similarity
scores = text_emb @ label_embs.T
predicted_label = ["positive", "negative"][np.argmax(scores)]
print(f"Predicted: {predicted_label}, Scores: {scores}")

# Qwen3-Reranker-8B approach (using transformers directly)
# query = text, documents = label_descriptions
# Compute relevance scores using yes/no token logits, then argmax

Terminology

Zero-Shot ClassificationPredicting labels never seen during training. When new categories appear, just change the label name and use it immediately without retraining.

NLI Cross-EncoderA model trained on Natural Language Inference data. Pairs the text and label description as input to judge "does this text belong to this label?"

RerankerA model that re-sorts search results. Used in RAG to re-score top-k documents; here labels are treated as "related documents" for classification.

Macro F1An equal average of F1 scores per class. Fairly reflects minority class performance even with class imbalance.

Embedding ModelA model that converts text to high-dimensional vectors. Trained so semantically similar texts are positioned close together in vector space.

VerbalizationExpressing numeric labels as natural language sentences. E.g.: class 1 → "The sentiment is positive". The core technique for zero-shot classification.

Instruction-Tuned LLMA language model additionally trained to follow natural language instructions. Models like GPT that follow "do ~" instructions.

MTEBMassive Text Embedding Benchmark. A standard benchmark evaluating embedding model performance across multiple tasks. Differs from BTZSC in using some labeled data for classification tasks.

Related Resources

Original Abstract (Expand)

Zero-shot text classification (ZSC) offers the promise of eliminating costly task-specific annotation by matching texts directly to human-readable label descriptions. While early approaches have predominantly relied on cross-encoder models fine-tuned for natural language inference (NLI), recent advances in text-embedding models, rerankers, and instruction-tuned large language models (LLMs) have challenged the dominance of NLI-based architectures. Yet, systematically comparing these diverse approaches remains difficult. Existing evaluations, such as MTEB, often incorporate labeled examples through supervised probes or fine-tuning, leaving genuine zero-shot capabilities underexplored. To address this, we introduce BTZSC, a comprehensive benchmark of 22 public datasets spanning sentiment, topic, intent, and emotion classification, capturing diverse domains, class cardinalities, and document lengths. Leveraging BTZSC, we conduct a systematic comparison across four major model families, NLI cross-encoders, embedding models, rerankers and instruction-tuned LLMs, encompassing 38 public and custom checkpoints. Our results show that: (i) modern rerankers, exemplified by Qwen3-Reranker-8B, set a new state-of-the-art with macro F1 = 0.72; (ii) strong embedding models such as GTE-large-en-v1.5 substantially close the accuracy gap while offering the best trade-off between accuracy and latency; (iii) instruction-tuned LLMs at 4--12B parameters achieve competitive performance (macro F1 up to 0.67), excelling particularly on topic classification but trailing specialized rerankers; (iv) NLI cross-encoders plateau even as backbone size increases; and (v) scaling primarily benefits rerankers and LLMs over embedding models. BTZSC and accompanying evaluation code are publicly released to support fair and reproducible progress in zero-shot text understanding.