MixLLM: Dynamic Routing in Mixed Large Language Models

Feb 9, 2025•Xinyuan Wang, Yanchi Liu, Wei Cheng +5•View PDF

TL;DR Highlight

A routing system that automatically selects the optimal model per query across multiple LLMs — achieving 97% of GPT-4 quality at 24% of the cost.

Who Should Read

Backend/ML engineers mixing multiple LLM APIs who want both quality and cost control. Especially teams wanting to reduce GPT-4 usage while maintaining quality.

Core Mechanics

Builds a router considering quality, cost, and latency simultaneously to select the optimal LLM from candidates (GPT-4, GPT-3.5, Llama, etc.) per query
Uses InsTag model to tag queries with domain labels and unsupervised BERT encoder fine-tuning to create routing-specialized embeddings
Separate lightweight prediction models (Random Forest, MLP, KNN) per LLM predict response quality and cost — no full retraining needed when adding new models
Applies latency penalty when queries flood a specific LLM to automatically prevent bottlenecks
Supports continual learning with user feedback (thumbs up/down) after deployment — uses Contextual Bandit (context-based RL)
After adding Llama 3.1 8B/70B: 98.55% of GPT-4 quality at 16.79% cost — flexibly handles LLM pool expansion

Evidence

97.25% of GPT-4 quality at 24.18% of GPT-4 cost (best baseline OptLLM: 96.39% quality at 32.94% cost)
After adding Llama 3.1: 98.55% of GPT-4 quality at 16.79% cost
Tag-enhanced embeddings: 5.72% response quality improvement in low-cost range (53.14% to 56.18%)
With 70% online learning data: refined feedback +2.22%, binary feedback +1.31% performance improvement

How to Apply

If running GPT-4, GPT-3.5-turbo, and Llama 3.1 together: train lightweight quality predictors for each model, then add a routing layer using query embedding + predicted quality/cost scores.
If your service has user satisfaction feedback (thumbs up/down): attach Contextual Bandit online learning to improve routing accuracy over time.
If specific LLM APIs slow down at certain times: include latency-based exponential penalty in routing scores to automatically distribute to less congested models.

Code Example

snippet

# MixLLM Routing Logic Core Pseudocode
import numpy as np
from sklearn.ensemble import RandomForestRegressor

class MixLLMRouter:
    def __init__(self, llm_candidates, lambda_=1.4, alpha=0.01, beta=0.1):
        """
        llm_candidates: [{'name': 'gpt-4', 'price_in': 0.03, 'price_out': 0.06, 'speed': 50}, ...]
        lambda_: quality vs cost priority (higher = quality first)
        """
        self.llms = llm_candidates
        self.lambda_ = lambda_
        self.alpha = alpha
        self.beta = beta
        # Independent quality predictor per LLM
        self.quality_predictors = {llm['name']: RandomForestRegressor() for llm in llm_candidates}
        self.waiting_times = {llm['name']: 0.0 for llm in llm_candidates}
    
    def score(self, query_embedding, llm_name, predicted_quality, predicted_cost):
        # 1. Quality-cost tradeoff score
        lam = self.lambda_
        s_trade = (lam / (lam + 1)) * predicted_quality - (1 / (lam + 1)) * predicted_cost
        
        # 2. Uncertainty bonus (exploration)
        # Actual implementation uses LinUCB-style inverse covariance matrix
        s_unc = 0.01  # simplified
        
        # 3. Latency penalty (suppresses selection when waiting time is long)
        gamma, xi, tau = 0.1, 0.8, 30.0
        w = self.waiting_times[llm_name]
        s_pen = np.exp(gamma * (w - xi * tau))
        
        return s_trade + self.alpha * s_unc - self.beta * s_pen
    
    def route(self, query_embedding):
        scores = {}
        for llm in self.llms:
            name = llm['name']
            # Predict quality/cost per LLM
            pred_quality = self.quality_predictors[name].predict([query_embedding])[0]
            pred_cost = 0.001 * len(query_embedding)  # simplified cost estimate
            scores[name] = self.score(query_embedding, name, pred_quality, pred_cost)
        
        # Select LLM with highest score
        best_llm = max(scores, key=scores.get)
        return best_llm

# Usage example
llms = [
    {'name': 'gpt-4', 'price_in': 0.03, 'price_out': 0.06},
    {'name': 'gpt-3.5-turbo', 'price_in': 0.001, 'price_out': 0.002},
    {'name': 'llama-3.1-70b', 'price_in': 0.0009, 'price_out': 0.0009},
]
router = MixLLMRouter(llms, lambda_=1.4)
# selected_llm = router.route(query_embedding)

Terminology

Contextual BanditAn RL technique. Looks at context (query content) and picks from multiple options (LLMs) to maximize reward (response quality).

LLM RoutingAutomatically directing incoming questions to the AI model best able to answer them.

Continual LearningA method that continues learning from new data and feedback after deployment.

InsTagA model that automatically tags LLM instructions with domain labels.

t-SNEA technique for visualizing high-dimensional data (embeddings) on a 2D plane.

OOD (Out-of-Domain)When inputs from new domains not in training data arrive.

Policy GradientAn RL method that adjusts the probability of selecting certain actions.

Related Resources

Original Abstract (Expand)

Large Language Models (LLMs) exhibit potential artificial generic intelligence recently, however, their usage is costly with high response latency. Given mixed LLMs with their own strengths and weaknesses, LLM routing aims to identify the most suitable model for each query in the stream to maximize response quality and minimize cost and latency. However, the challenges involve: (1) dynamic trade-offs among quality, cost, and latency; (2) enabling continual learning in deployed systems; and (3) navigating a varying (e.g., new LLM addition or old LLM removal) set of LLM candidates over time. To bridge these gaps, we develop MixLLM, a dynamic contextual-bandit-based routing system for query-LLM assignment. Specifically, we first leverage query tags to enhance query embeddings for the routing task. Next, we design lightweight prediction models to estimate the response qualities and costs of queries over LLMs. We then devise a meta-decision maker to choose the query-LLM assignments to best tradeoff response quality, cost, and latency. Finally, the system benefits from continual training, allowing it to adapt to evolving queries and user feedback over time. Our extensive experiments show that MixLLM achieves the best trade-offs in response quality, cost, and latency (97.25% of GPT-4's quality at 24.18% of the cost under the time constraint).