MixLLM: Dynamic Routing in Mixed Large Language Models
TL;DR Highlight
A routing system that automatically selects the optimal model per query across multiple LLMs — achieving 97% of GPT-4 quality at 24% of the cost.
Who Should Read
Backend/ML engineers mixing multiple LLM APIs who want both quality and cost control. Especially teams wanting to reduce GPT-4 usage while maintaining quality.
Core Mechanics
- Builds a router considering quality, cost, and latency simultaneously to select the optimal LLM from candidates (GPT-4, GPT-3.5, Llama, etc.) per query
- Uses InsTag model to tag queries with domain labels and unsupervised BERT encoder fine-tuning to create routing-specialized embeddings
- Separate lightweight prediction models (Random Forest, MLP, KNN) per LLM predict response quality and cost — no full retraining needed when adding new models
- Applies latency penalty when queries flood a specific LLM to automatically prevent bottlenecks
- Supports continual learning with user feedback (thumbs up/down) after deployment — uses Contextual Bandit (context-based RL)
- After adding Llama 3.1 8B/70B: 98.55% of GPT-4 quality at 16.79% cost — flexibly handles LLM pool expansion
Evidence
- 97.25% of GPT-4 quality at 24.18% of GPT-4 cost (best baseline OptLLM: 96.39% quality at 32.94% cost)
- After adding Llama 3.1: 98.55% of GPT-4 quality at 16.79% cost
- Tag-enhanced embeddings: 5.72% response quality improvement in low-cost range (53.14% to 56.18%)
- With 70% online learning data: refined feedback +2.22%, binary feedback +1.31% performance improvement
How to Apply
- If running GPT-4, GPT-3.5-turbo, and Llama 3.1 together: train lightweight quality predictors for each model, then add a routing layer using query embedding + predicted quality/cost scores.
- If your service has user satisfaction feedback (thumbs up/down): attach Contextual Bandit online learning to improve routing accuracy over time.
- If specific LLM APIs slow down at certain times: include latency-based exponential penalty in routing scores to automatically distribute to less congested models.
Code Example
# MixLLM Routing Logic Core Pseudocode
import numpy as np
from sklearn.ensemble import RandomForestRegressor
class MixLLMRouter:
def __init__(self, llm_candidates, lambda_=1.4, alpha=0.01, beta=0.1):
"""
llm_candidates: [{'name': 'gpt-4', 'price_in': 0.03, 'price_out': 0.06, 'speed': 50}, ...]
lambda_: quality vs cost priority (higher = quality first)
"""
self.llms = llm_candidates
self.lambda_ = lambda_
self.alpha = alpha
self.beta = beta
# Independent quality predictor per LLM
self.quality_predictors = {llm['name']: RandomForestRegressor() for llm in llm_candidates}
self.waiting_times = {llm['name']: 0.0 for llm in llm_candidates}
def score(self, query_embedding, llm_name, predicted_quality, predicted_cost):
# 1. Quality-cost tradeoff score
lam = self.lambda_
s_trade = (lam / (lam + 1)) * predicted_quality - (1 / (lam + 1)) * predicted_cost
# 2. Uncertainty bonus (exploration)
# Actual implementation uses LinUCB-style inverse covariance matrix
s_unc = 0.01 # simplified
# 3. Latency penalty (suppresses selection when waiting time is long)
gamma, xi, tau = 0.1, 0.8, 30.0
w = self.waiting_times[llm_name]
s_pen = np.exp(gamma * (w - xi * tau))
return s_trade + self.alpha * s_unc - self.beta * s_pen
def route(self, query_embedding):
scores = {}
for llm in self.llms:
name = llm['name']
# Predict quality/cost per LLM
pred_quality = self.quality_predictors[name].predict([query_embedding])[0]
pred_cost = 0.001 * len(query_embedding) # simplified cost estimate
scores[name] = self.score(query_embedding, name, pred_quality, pred_cost)
# Select LLM with highest score
best_llm = max(scores, key=scores.get)
return best_llm
# Usage example
llms = [
{'name': 'gpt-4', 'price_in': 0.03, 'price_out': 0.06},
{'name': 'gpt-3.5-turbo', 'price_in': 0.001, 'price_out': 0.002},
{'name': 'llama-3.1-70b', 'price_in': 0.0009, 'price_out': 0.0009},
]
router = MixLLMRouter(llms, lambda_=1.4)
# selected_llm = router.route(query_embedding)Terminology
Related Resources
Original Abstract (Expand)
Large Language Models (LLMs) exhibit potential artificial generic intelligence recently, however, their usage is costly with high response latency. Given mixed LLMs with their own strengths and weaknesses, LLM routing aims to identify the most suitable model for each query in the stream to maximize response quality and minimize cost and latency. However, the challenges involve: (1) dynamic trade-offs among quality, cost, and latency; (2) enabling continual learning in deployed systems; and (3) navigating a varying (e.g., new LLM addition or old LLM removal) set of LLM candidates over time. To bridge these gaps, we develop MixLLM, a dynamic contextual-bandit-based routing system for query-LLM assignment. Specifically, we first leverage query tags to enhance query embeddings for the routing task. Next, we design lightweight prediction models to estimate the response qualities and costs of queries over LLMs. We then devise a meta-decision maker to choose the query-LLM assignments to best tradeoff response quality, cost, and latency. Finally, the system benefits from continual training, allowing it to adapt to evolving queries and user feedback over time. Our extensive experiments show that MixLLM achieves the best trade-offs in response quality, cost, and latency (97.25% of GPT-4's quality at 24.18% of the cost under the time constraint).