RelayLLM: Efficient Reasoning via Collaborative Decoding

Jan 8, 2026•Chengsong Huang, Tong Zheng, Langlin Huang +3•View PDF

TL;DR Highlight

A collaborative inference framework where a small model calls a large model only on difficult tokens, maintaining accuracy while cutting costs by 98%.

Who Should Read

ML engineers who find LLM inference costs burdensome and are considering replacing them with smaller models or designing routing strategies—especially teams that need to optimize the cost-performance tradeoff on math/reasoning tasks.

Core Mechanics

Traditional routers use an 'all-or-nothing' approach of passing the entire query to a large model → RelayLLM generates a <call> command at the token level and invokes the large model only when needed
Qwen3-1.7B (small) uses Qwen3-8B (large) as a teacher, with only 1.07% of total tokens generated by the large model → the remaining 98.93% are handled directly by the small model
Two-stage training with GRPO (a reinforcement learning technique): ① cold start (learning command syntax) → ② difficulty-aware rewards to optimize 'when to ask for help'
Three reward designs by difficulty level: independence bonus if solvable alone (r=1.5), penalty if unsolvable without the teacher (r=-1.0), exploration reward if neither can solve it (r=ρ)
Trained only on math data but successfully generalized to other domains such as MMLU-Pro and Big-Bench Hard (MMLU-Pro: GRPO 49.76% → RelayLLM 59.03%)
In Teacher-Free evaluation (calls completely blocked), performance on easy benchmarks exceeded the GRPO baseline → collaborative training also improves the small model's intrinsic reasoning ability

Evidence

Average accuracy across 6 math benchmarks: small model alone 42.5% → RelayLLM 49.52%, recovering 60% of the gap to the large model (Qwen3-8B) at 54.12%
6.9% accuracy improvement over a Random Router at equivalent cost; 98.2% token cost reduction compared to a router of equivalent performance
On the Minerva benchmark with Qwen3-0.6B: 15.81% → 23.53% (48.8% relative improvement) with a large model call ratio of 0.77%
On MMLU-Pro with Qwen3-1.7B: GRPO 49.76% vs CITER 53.38% vs RelayLLM 59.03%, confirming out-of-domain generalization advantage

How to Apply

Deploy the small model as the primary inference engine and the large model as an API; implement stop-sequence-based switching (using vLLM stop sequences) so that when the small model generates a <call>N</call> token, it delegates generation of N tokens to the large model
Use the same model family (e.g., Qwen3-0.6B student + Qwen3-8B teacher) to ensure tokenizer and vocabulary distribution alignment for stable collaboration — replacing with a larger external model may cause distribution mismatch and performance degradation
Training data filtering is essential: remove problems that the large model fails to solve more than 5 out of 10 times → skipping this step increases the call ratio by 3x and actually lowers accuracy (48.76%)

Code Example

snippet

# Example implementation of RelayLLM switching with vLLM
from vllm import LLM, SamplingParams

slm = LLM(model="Qwen/Qwen3-1.7B")
llm = LLM(model="Qwen/Qwen3-8B")

def relay_generate(prompt: str, max_rounds: int = 10) -> str:
    context = prompt
    for _ in range(max_rounds):
        # Small model generation (stops at call token)
        params = SamplingParams(
            max_tokens=512,
            stop=["</call>"],
            include_stop_str_in_output=True
        )
        out = slm.generate([context], params)[0].outputs[0].text
        
        if "<call>" not in out:
            return context + out  # Done
        
        # Parse call command
        import re
        match = re.search(r"<call>\s*(\d+)\s*</call>", out)
        n_tokens = int(match.group(1)) if match else 50
        
        # Remove call token and pass to large model
        clean_context = re.sub(r"<call>.*?</call>", "", context + out)
        llm_params = SamplingParams(max_tokens=n_tokens)
        llm_out = llm.generate([clean_context], llm_params)[0].outputs[0].text
        
        # Update context (SLM retains history including call tokens)
        context = context + out + llm_out
    
    return context

Terminology

SLMSmall Language Model. A small, fast language model at the 1–2B parameter scale. Can run on smartphones or low-cost servers, but struggles with complex reasoning.

GRPOGroup Relative Policy Optimization. A reinforcement learning technique for improving models. It samples multiple answers to the same problem, rewards answers that perform better than average, and penalizes worse ones to progressively learn better strategies.

RLVRReinforcement Learning with Verifiable Reward. A reinforcement learning approach that automatically computes rewards using rule-based verification for tasks with clearly checkable answers (e.g., math). No need for human evaluation.

Collaborative DecodingA method where two or more models take turns generating text. One model writes until it hits a difficult part, then hands off to another model, and resumes after receiving the output.

Call RatioThe proportion of total generated tokens that were produced by the large model. A ratio of 1% means 99% was handled by the small model. Lower values indicate better cost efficiency.

Cold StartA supervised learning phase before reinforcement learning that teaches the model its basic behavior. Starting RL from a completely blank slate leads to erratic behavior, so this 'warm-up' step is necessary.

vLLMAn open-source engine for high-speed LLM inference serving. Uses optimization techniques such as PagedAttention to efficiently utilize GPU memory and significantly increase throughput.

Related Resources

https://github.com/Chengsong-Huang/RelayLLM

Original Abstract (Expand)

Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively"relaying"the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.