LLMRouterBench: A Massive Benchmark and Unified Framework for LLM Routing
TL;DR Highlight
A unified benchmark of 10 LLM routing techniques across 400K queries reveals that even the latest commercial routers underperform the single best model.
Who Should Read
ML engineers and backend developers looking to build routing systems that automatically assign the most suitable LLM per query type to reduce the cost of calling expensive models like GPT-5. Teams operating production systems that combine multiple LLM APIs.
Core Mechanics
- Models have domain-specific strengths — e.g., Qwen3-8B and Intern-S1-mini excel at math, while Qwen-Coder and Fin-R1 lead on code — confirming that the core premise of routing (model complementarity) holds in practice
- Under controlled evaluation conditions, top routers such as EmbedLLM, GraphRouter, MODEL-SAT, and Avengers show nearly identical performance; even the clustering-based Avengers, which requires no neural network training, matches the others
- The commercial router OpenRouter scores 24.7% lower in accuracy than the Best Single model (GPT-5) — complex routers do not always beat simple strategies
- The large performance gap between current routers and the Oracle (ideal selection) is primarily caused by model-recall failure: the inability to identify rare queries that only one or two models can answer correctly
- Swapping the embedding model to the 22M-parameter all-MiniLM produces negligible differences in routing performance — using a better embedding model is not the bottleneck
- Blindly expanding the model pool yields diminishing returns; a carefully curated small set of models outperforms a large random pool
Evidence
- Commercial router OpenRouter achieves PerfGain of -24.7% vs. Best Single (GPT-5) — worse than a single model despite using a larger model pool
- Avengers-Pro (clustering-based) achieves up to +4% accuracy improvement and 31.7% cost reduction simultaneously vs. Best Single, with ParetoDist ≈ 0.001, placing it near the Pareto frontier
- On hard queries where only 3 or fewer models answer correctly (11.9% of queries, 410 total), Avengers scores 24.6% and EmbedLLM 23.2% accuracy — failure to detect rare specialist models is the key bottleneck
- Replacing the embedding model from gte-qwen2-7B (7B) to all-MiniLM-L6-v2 (22.7M) results in negligible performance changes across GraphRouter, EmbedLLM, and Avengers (70.29→68.05, 71.24→70.95, 71.94→71.03)
How to Apply
- For systems using multiple LLM APIs, start by trying an Avengers-Pro-style approach: cluster queries with k-means (k=64) and assign the best-performing model per cluster — this can achieve top-tier performance without any neural network training
- If cost reduction is the goal, avoid indiscriminately adding models to the pool; instead, curate a small pool of top-k models per domain for better efficiency — curation takes priority over pool expansion
- When evaluating your own router, use Best Single (the top-performing single model) as the baseline — if your router can't beat it, the routing provides no value; results are reproducible using the publicly available code based on LLMRouterBench's 21 datasets
Code Example
# Avengers-Pro style clustering router (simplified implementation)
from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer
import numpy as np
class ClusteringRouter:
def __init__(self, model_pool, k=64, embed_model='all-MiniLM-L6-v2'):
self.model_pool = model_pool # {'gpt-5': client_gpt5, 'qwen3-235b': client_qwen, ...}
self.k = k
self.embedder = SentenceTransformer(embed_model) # embedding model quality has little impact
self.kmeans = None
self.cluster_to_model = {} # mapping of optimal model per cluster
def fit(self, train_queries, train_labels):
"""train_labels: {query_i: {'gpt-5': correct, 'qwen3-235b': correct, ...}}"""
embeddings = self.embedder.encode(train_queries)
self.kmeans = KMeans(n_clusters=self.k, random_state=42)
cluster_ids = self.kmeans.fit_predict(embeddings)
# Select the model with the highest accuracy for each cluster
for c in range(self.k):
idxs = np.where(cluster_ids == c)[0]
scores = {m: np.mean([train_labels[i].get(m, 0) for i in idxs])
for m in self.model_pool}
self.cluster_to_model[c] = max(scores, key=scores.get)
def route(self, query, alpha=0.7):
"""alpha: performance weight (1.0=performance first, 0.0=cost first)"""
emb = self.embedder.encode([query])
cluster = self.kmeans.predict(emb)[0]
return self.cluster_to_model[cluster]
# Usage example
# router = ClusteringRouter(model_pool={'gpt-5': ..., 'qwen3-235b': ..., 'gemini-flash': ...})
# router.fit(train_queries, train_labels)
# best_model = router.route('Prove that sqrt(2) is irrational')Terminology
Related Resources
Original Abstract (Expand)
Large language model (LLM) routing assigns each query to the most suitable model from an ensemble. We introduce LLMRouterBench, a large-scale benchmark and unified framework for LLM routing. It comprises over 400K instances from 21 datasets and 33 models. Moreover, it provides comprehensive metrics for both performance-oriented routing and performance-cost trade-off routing, and integrates 10 representative routing baselines. Using LLMRouterBench, we systematically re-evaluate the field. While confirming strong model complementarity-the central premise of LLM routing-we find that many routing methods exhibit similar performance under unified evaluation, and several recent approaches, including commercial routers, fail to reliably outperform a simple baseline. Meanwhile, a substantial gap remains to the Oracle, driven primarily by persistent model-recall failures. We further show that backbone embedding models have limited impact, that larger ensembles exhibit diminishing returns compared to careful model curation, and that the benchmark also enables latency-aware analysis. All code and data are available at https://github.com/ynulihao/LLMRouterBench.