Querywise Prompt Routing for Large Language Models
TL;DR Highlight
Even for the same question, the optimal prompt differs — a routing technique that automatically selects the best-matching prompt for each query.
Who Should Read
Developers and researchers running LLM pipelines at scale who want to squeeze out accuracy gains without fine-tuning. Useful when you have a diverse query workload.
Core Mechanics
- Query-level prompt routing: dynamically assigns the best prompt template per query rather than using a single fixed prompt for all
- Routing decisions are made using lightweight classifiers (e.g., k-NN, logistic regression) trained on query embeddings
- Demonstrated accuracy improvements over single-prompt baselines across multiple QA and reasoning benchmarks
- Compatible with any LLM backend — the routing layer is independent of the underlying model
- Low routing overhead; the classifier inference adds negligible latency compared to LLM calls
Evidence
- Outperforms best single-prompt baselines across evaluated benchmarks
- Routing classifier trained on hundreds of labeled query–prompt pairs achieves high routing accuracy
- Gains are consistent across GPT-3.5, GPT-4, and open-source models
How to Apply
- Build a library of prompt templates each optimized for a specific query type (factual, reasoning, creative, etc.).
- Train a lightweight classifier (k-NN or logistic regression) on query embeddings to predict which prompt template to use.
- Deploy the router in front of your LLM call; keep the routing model small to avoid latency overhead.
Code Example
# Example implementation of per-query prompt routing concept (pseudo-code)
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
import numpy as np
# 1. Prepare (query, prompt, score) pairs from historical logs
# score: preference label based on response quality (e.g., 0 or 1)
logs = [
{"query": "What is 3 + 5 * 2?", "prompt": "Calculate step by step.", "score": 1},
{"query": "What is 3 + 5 * 2?", "prompt": "Answer directly.", "score": 0},
# ... more logs
]
encoder = SentenceTransformer("all-MiniLM-L6-v2")
# 2. Embed (query + prompt) pairs to train a preference model
X = []
y = []
for log in logs:
pair_text = log["query"] + " [SEP] " + log["prompt"]
X.append(encoder.encode(pair_text))
y.append(log["score"])
reward_model = LogisticRegression()
reward_model.fit(np.array(X), y)
# 3. Select best-of-N prompt for a new query
def route_prompt(query: str, candidate_prompts: list[str]) -> str:
scores = []
for prompt in candidate_prompts:
pair_text = query + " [SEP] " + prompt
emb = encoder.encode(pair_text).reshape(1, -1)
score = reward_model.predict_proba(emb)[0][1] # positive class prob
scores.append(score)
best_idx = int(np.argmax(scores))
return candidate_prompts[best_idx]
# Usage example
candidates = [
"Solve step by step and provide the final answer.",
"Calculate the expression as-is and reply with only the number.",
"First check the order of operations, then calculate.",
]
query = "What is the result of (12 / 4) + 3 * 7?"
best_prompt = route_prompt(query, candidates)
print(f"Selected prompt: {best_prompt}")Terminology
Related Papers
Multilingual Reasoning Cascades Need More Context
번역 cascade 파이프라인에서 원본 질문을 마지막까지 유지하면 추가 학습 없이 다국어 성능이 크게 오른다.
Less Back-and-Forth: A Comparative Study of Structured Prompting
체크리스트 형식으로 프롬프트를 구조화하면 LLM 답변 품질도 높아지고 토큰도 적게 쓴다.
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
재학습 없이 각 나라의 도덕적 가치관에 맞게 LLM 출력을 조정하는 추론 시점 기법 DISCA 제안
Using Claude Code: The unreasonable effectiveness of HTML
Claude Code 팀이 Markdown 대신 HTML을 LLM 출력 포맷으로 선호하기 시작한 이유와 그 실용적 장점을 정리한 글로, AI와 함께 문서/스펙/대시보드를 만드는 워크플로우에 직접적인 영향을 준다.
When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling
Disagreement-guided routing boosts LLM accuracy on math and code by 3-7% with adaptive problem solving.
Less Is More: Engineering Challenges of On-Device Small Language Model Integration in a Mobile Application
Five failure modes and eight practical solutions emerged after five days of running on-device SLMs (Gemma 4 E2B, Qwen3 0.6B) with Wordle.
Original Abstract (Expand)
This paper treats prompt choice as a per-query decision problem for large language models, learning an of-fline proxy reward that can score query-prompt pairs without additional model calls or access to gold answers at inference time. Using prior prompt-response logs as demonstrations, the method trains a preference model over prompts and then selects a best-of-N instruction per query to boost arithmetic reasoning accuracy under strict zero-shot conditions. The pipeline reduces interaction cost by shifting evaluation and optimization offline, while preserving the natural-language prompt space so the approach remains model-agnostic and immediately deployable across chat-oriented LLMs. Experiments on standard reasoning benchmarks show consistent gains over distribution-level, query-agnostic prompting and over confidence-based selectors, with improvements holding across multiple LLM scales. Ablations confirm that the learned reward generalizes to unseen prompts and queries, enabling robust prompt routing at inference without additional gradient updates or tool-specific supervision.