BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs
TL;DR Highlight
Which model to use for zero-shot text classification without labeled data — direct comparison of 38 models across 22 datasets.
Who Should Read
ML engineers and backend devs building text classification pipelines without labeling costs. Especially for those wanting to handle sentiment analysis, intent detection, and topic classification zero-shot.
Core Mechanics
- Qwen3-Reranker-8B takes #1 overall with macro F1 0.72 — F1 +12 points, accuracy +14 points above the previous best NLI cross-encoder
- Embedding models have the best speed-vs-accuracy trade-off — gte-large-en-v1.5 (F1 0.62) beats all NLI cross-encoders while being faster
- NLI cross-encoders (BART-MNLI-style structures) hit a performance ceiling regardless of model size — best is deberta-v3-large-nli-triplet at F1 0.60
- LLMs become competitive from 4B+ — Qwen3-4B at F1 0.65, Mistral-Nemo-12B at 0.67, but still 5 points below Qwen3-Reranker-8B
- Embedding model scaling has almost no effect — Qwen3-Embedding-8B (F1 0.59) vs 0.6B (F1 0.58), negligible difference
- Emotion classification is the hardest task — all model families score F1 0.25-0.5, far lower than topic/sentiment tasks
Evidence
- Qwen3-Reranker-8B: macro F1 0.72, accuracy 0.76 — #1 of 38 models, +12 F1 over NLI best deberta-v3-large-nli-triplet (F1 0.60)
- gte-large-en-v1.5: macro F1 0.62, beats all NLI cross-encoders while inference speed is much faster (Pareto-efficient)
- BTZSC vs MTEB ranking Kendall τ = 0.69 (p < 10^-8) — strong agreement between the two benchmark model rankings, confirming result reliability
- Qwen3-Reranker-0.6B (F1 0.61) alone exceeds all NLI cross-encoders by F1 — even small rerankers are stronger than the old approach
How to Apply
- For real-time classification services where latency matters, use gte-large-en-v1.5 or gte-modernbert-base as a cosine similarity-based zero-shot classifier directly — verbalize each label as "The sentiment of this review is {label}" and compute cosine similarity with text embeddings
- When accuracy is the top priority and batch processing is possible, use Qwen3-Reranker-8B — put the input text as query and verbalized labels as documents, classify by reranking score
- If you're already using Qwen3 family in your LLM pipeline, use Qwen3-4B as a zero-shot classifier with a multiple-choice prompt — attach A/B/C options to each label and select by next-token probability
Code Example
# Zero-shot sentiment classification example with gte-large-en-v1.5
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('Alibaba-NLP/gte-large-en-v1.5', trust_remote_code=True)
text = "This product exceeded all my expectations. Highly recommend!"
# Label verbalization
label_descriptions = [
"The overall sentiment within the Amazon product review is positive",
"The overall sentiment within the Amazon product review is negative"
]
# Compute embeddings
text_emb = model.encode(text, normalize_embeddings=True)
label_embs = model.encode(label_descriptions, normalize_embeddings=True)
# Classify using cosine similarity
scores = text_emb @ label_embs.T
predicted_label = ["positive", "negative"][np.argmax(scores)]
print(f"Predicted: {predicted_label}, Scores: {scores}")
# Qwen3-Reranker-8B approach (using transformers directly)
# query = text, documents = label_descriptions
# Compute relevance scores using yes/no token logits, then argmaxTerminology
Related Resources
Original Abstract (Expand)
Zero-shot text classification (ZSC) offers the promise of eliminating costly task-specific annotation by matching texts directly to human-readable label descriptions. While early approaches have predominantly relied on cross-encoder models fine-tuned for natural language inference (NLI), recent advances in text-embedding models, rerankers, and instruction-tuned large language models (LLMs) have challenged the dominance of NLI-based architectures. Yet, systematically comparing these diverse approaches remains difficult. Existing evaluations, such as MTEB, often incorporate labeled examples through supervised probes or fine-tuning, leaving genuine zero-shot capabilities underexplored. To address this, we introduce BTZSC, a comprehensive benchmark of 22 public datasets spanning sentiment, topic, intent, and emotion classification, capturing diverse domains, class cardinalities, and document lengths. Leveraging BTZSC, we conduct a systematic comparison across four major model families, NLI cross-encoders, embedding models, rerankers and instruction-tuned LLMs, encompassing 38 public and custom checkpoints. Our results show that: (i) modern rerankers, exemplified by Qwen3-Reranker-8B, set a new state-of-the-art with macro F1 = 0.72; (ii) strong embedding models such as GTE-large-en-v1.5 substantially close the accuracy gap while offering the best trade-off between accuracy and latency; (iii) instruction-tuned LLMs at 4--12B parameters achieve competitive performance (macro F1 up to 0.67), excelling particularly on topic classification but trailing specialized rerankers; (iv) NLI cross-encoders plateau even as backbone size increases; and (v) scaling primarily benefits rerankers and LLMs over embedding models. BTZSC and accompanying evaluation code are publicly released to support fair and reproducible progress in zero-shot text understanding.