RelayLLM: Efficient Reasoning via Collaborative Decoding
TL;DR Highlight
A collaborative inference framework where a small model calls a large model only on difficult tokens, maintaining accuracy while cutting costs by 98%.
Who Should Read
ML engineers who find LLM inference costs burdensome and are considering replacing them with smaller models or designing routing strategies—especially teams that need to optimize the cost-performance tradeoff on math/reasoning tasks.
Core Mechanics
- Traditional routers use an 'all-or-nothing' approach of passing the entire query to a large model → RelayLLM generates a <call> command at the token level and invokes the large model only when needed
- Qwen3-1.7B (small) uses Qwen3-8B (large) as a teacher, with only 1.07% of total tokens generated by the large model → the remaining 98.93% are handled directly by the small model
- Two-stage training with GRPO (a reinforcement learning technique): ① cold start (learning command syntax) → ② difficulty-aware rewards to optimize 'when to ask for help'
- Three reward designs by difficulty level: independence bonus if solvable alone (r=1.5), penalty if unsolvable without the teacher (r=-1.0), exploration reward if neither can solve it (r=ρ)
- Trained only on math data but successfully generalized to other domains such as MMLU-Pro and Big-Bench Hard (MMLU-Pro: GRPO 49.76% → RelayLLM 59.03%)
- In Teacher-Free evaluation (calls completely blocked), performance on easy benchmarks exceeded the GRPO baseline → collaborative training also improves the small model's intrinsic reasoning ability
Evidence
- Average accuracy across 6 math benchmarks: small model alone 42.5% → RelayLLM 49.52%, recovering 60% of the gap to the large model (Qwen3-8B) at 54.12%
- 6.9% accuracy improvement over a Random Router at equivalent cost; 98.2% token cost reduction compared to a router of equivalent performance
- On the Minerva benchmark with Qwen3-0.6B: 15.81% → 23.53% (48.8% relative improvement) with a large model call ratio of 0.77%
- On MMLU-Pro with Qwen3-1.7B: GRPO 49.76% vs CITER 53.38% vs RelayLLM 59.03%, confirming out-of-domain generalization advantage
How to Apply
- Deploy the small model as the primary inference engine and the large model as an API; implement stop-sequence-based switching (using vLLM stop sequences) so that when the small model generates a <call>N</call> token, it delegates generation of N tokens to the large model
- Use the same model family (e.g., Qwen3-0.6B student + Qwen3-8B teacher) to ensure tokenizer and vocabulary distribution alignment for stable collaboration — replacing with a larger external model may cause distribution mismatch and performance degradation
- Training data filtering is essential: remove problems that the large model fails to solve more than 5 out of 10 times → skipping this step increases the call ratio by 3x and actually lowers accuracy (48.76%)
Code Example
# Example implementation of RelayLLM switching with vLLM
from vllm import LLM, SamplingParams
slm = LLM(model="Qwen/Qwen3-1.7B")
llm = LLM(model="Qwen/Qwen3-8B")
def relay_generate(prompt: str, max_rounds: int = 10) -> str:
context = prompt
for _ in range(max_rounds):
# Small model generation (stops at call token)
params = SamplingParams(
max_tokens=512,
stop=["</call>"],
include_stop_str_in_output=True
)
out = slm.generate([context], params)[0].outputs[0].text
if "<call>" not in out:
return context + out # Done
# Parse call command
import re
match = re.search(r"<call>\s*(\d+)\s*</call>", out)
n_tokens = int(match.group(1)) if match else 50
# Remove call token and pass to large model
clean_context = re.sub(r"<call>.*?</call>", "", context + out)
llm_params = SamplingParams(max_tokens=n_tokens)
llm_out = llm.generate([clean_context], llm_params)[0].outputs[0].text
# Update context (SLM retains history including call tokens)
context = context + out + llm_out
return contextTerminology
Related Resources
Original Abstract (Expand)
Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively"relaying"the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.