LLMRouterBench: A Massive Benchmark and Unified Framework for LLM Routing

Jan 12, 2026•Hao Li, Yiqun Zhang, Zhaoyan Guo +9•View PDF

TL;DR Highlight

A unified benchmark of 10 LLM routing techniques across 400K queries reveals that even the latest commercial routers underperform the single best model.

Who Should Read

ML engineers and backend developers looking to build routing systems that automatically assign the most suitable LLM per query type to reduce the cost of calling expensive models like GPT-5. Teams operating production systems that combine multiple LLM APIs.

Core Mechanics

Models have domain-specific strengths — e.g., Qwen3-8B and Intern-S1-mini excel at math, while Qwen-Coder and Fin-R1 lead on code — confirming that the core premise of routing (model complementarity) holds in practice
Under controlled evaluation conditions, top routers such as EmbedLLM, GraphRouter, MODEL-SAT, and Avengers show nearly identical performance; even the clustering-based Avengers, which requires no neural network training, matches the others
The commercial router OpenRouter scores 24.7% lower in accuracy than the Best Single model (GPT-5) — complex routers do not always beat simple strategies
The large performance gap between current routers and the Oracle (ideal selection) is primarily caused by model-recall failure: the inability to identify rare queries that only one or two models can answer correctly
Swapping the embedding model to the 22M-parameter all-MiniLM produces negligible differences in routing performance — using a better embedding model is not the bottleneck
Blindly expanding the model pool yields diminishing returns; a carefully curated small set of models outperforms a large random pool

Evidence

Commercial router OpenRouter achieves PerfGain of -24.7% vs. Best Single (GPT-5) — worse than a single model despite using a larger model pool
Avengers-Pro (clustering-based) achieves up to +4% accuracy improvement and 31.7% cost reduction simultaneously vs. Best Single, with ParetoDist ≈ 0.001, placing it near the Pareto frontier
On hard queries where only 3 or fewer models answer correctly (11.9% of queries, 410 total), Avengers scores 24.6% and EmbedLLM 23.2% accuracy — failure to detect rare specialist models is the key bottleneck
Replacing the embedding model from gte-qwen2-7B (7B) to all-MiniLM-L6-v2 (22.7M) results in negligible performance changes across GraphRouter, EmbedLLM, and Avengers (70.29→68.05, 71.24→70.95, 71.94→71.03)

How to Apply

For systems using multiple LLM APIs, start by trying an Avengers-Pro-style approach: cluster queries with k-means (k=64) and assign the best-performing model per cluster — this can achieve top-tier performance without any neural network training
If cost reduction is the goal, avoid indiscriminately adding models to the pool; instead, curate a small pool of top-k models per domain for better efficiency — curation takes priority over pool expansion
When evaluating your own router, use Best Single (the top-performing single model) as the baseline — if your router can't beat it, the routing provides no value; results are reproducible using the publicly available code based on LLMRouterBench's 21 datasets

Code Example

snippet

# Avengers-Pro style clustering router (simplified implementation)
from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer
import numpy as np

class ClusteringRouter:
    def __init__(self, model_pool, k=64, embed_model='all-MiniLM-L6-v2'):
        self.model_pool = model_pool  # {'gpt-5': client_gpt5, 'qwen3-235b': client_qwen, ...}
        self.k = k
        self.embedder = SentenceTransformer(embed_model)  # embedding model quality has little impact
        self.kmeans = None
        self.cluster_to_model = {}  # mapping of optimal model per cluster

    def fit(self, train_queries, train_labels):
        """train_labels: {query_i: {'gpt-5': correct, 'qwen3-235b': correct, ...}}"""
        embeddings = self.embedder.encode(train_queries)
        self.kmeans = KMeans(n_clusters=self.k, random_state=42)
        cluster_ids = self.kmeans.fit_predict(embeddings)

        # Select the model with the highest accuracy for each cluster
        for c in range(self.k):
            idxs = np.where(cluster_ids == c)[0]
            scores = {m: np.mean([train_labels[i].get(m, 0) for i in idxs])
                      for m in self.model_pool}
            self.cluster_to_model[c] = max(scores, key=scores.get)

    def route(self, query, alpha=0.7):
        """alpha: performance weight (1.0=performance first, 0.0=cost first)"""
        emb = self.embedder.encode([query])
        cluster = self.kmeans.predict(emb)[0]
        return self.cluster_to_model[cluster]

# Usage example
# router = ClusteringRouter(model_pool={'gpt-5': ..., 'qwen3-235b': ..., 'gemini-flash': ...})
# router.fit(train_queries, train_labels)
# best_model = router.route('Prove that sqrt(2) is irrational')

Terminology

LLM 라우팅A technique that automatically selects the most suitable model from a pool of AI models for each incoming query — acting like a smart switchboard that routes math problems to math-strong models and code tasks to coding-strong models.

OracleA hypothetical perfect router that knows the correct answer in advance for every query and always picks the best model. It cannot be implemented in practice but serves as a performance ceiling for benchmarking.

Best SingleA strategy of using only the single model with the highest average performance across all datasets. If a router cannot beat this baseline, there is no justification for adopting complex routing.

Pareto 최전선The set of configurations that are both low-cost and high-performance. No point on this frontier can improve performance without increasing cost, or reduce cost without degrading performance — it represents the optimal trade-off boundary.

model-recall failureThe phenomenon where a router fails to select the rare specialist model that is the only one capable of answering a difficult query correctly, instead routing to a suboptimal model. This is the most significant weakness of current routers.

k-means 클러스터링An algorithm that automatically groups similar queries into k clusters. It can be applied to routing by sending all queries in the same cluster to the same optimal model.

PerfGain / CostSaveMetrics indicating how much a routing method improves accuracy (PerfGain) or reduces cost (CostSave) compared to the Best Single baseline while maintaining equivalent performance.

Related Resources

https://github.com/ynulihao/LLMRouterBench

Original Abstract (Expand)

Large language model (LLM) routing assigns each query to the most suitable model from an ensemble. We introduce LLMRouterBench, a large-scale benchmark and unified framework for LLM routing. It comprises over 400K instances from 21 datasets and 33 models. Moreover, it provides comprehensive metrics for both performance-oriented routing and performance-cost trade-off routing, and integrates 10 representative routing baselines. Using LLMRouterBench, we systematically re-evaluate the field. While confirming strong model complementarity-the central premise of LLM routing-we find that many routing methods exhibit similar performance under unified evaluation, and several recent approaches, including commercial routers, fail to reliably outperform a simple baseline. Meanwhile, a substantial gap remains to the Oracle, driven primarily by persistent model-recall failures. We further show that backbone embedding models have limited impact, that larger ensembles exhibit diminishing returns compared to careful model curation, and that the benchmark also enables latency-aware analysis. All code and data are available at https://github.com/ynulihao/LLMRouterBench.