RelayLLM: 소형·대형 모델 협업 디코딩으로 추론 비용 98% 절감

RelayLLM: Efficient Reasoning via Collaborative Decoding

Jan 8, 2026•Chengsong Huang, Tong Zheng, Langlin Huang +3•View PDF

TL;DR Highlight

작은 모델이 어려운 토큰에서만 큰 모델을 호출해 정확도는 유지하고 비용은 98% 줄이는 협업 추론 프레임워크.

Who Should Read

LLM 추론 비용이 부담스러워서 소형 모델로 대체하거나 라우팅 전략을 고민하는 ML 엔지니어. 특히 수학/추론 태스크에서 비용-성능 트레이드오프를 최적화해야 하는 팀.

Core Mechanics

기존 라우터는 쿼리 전체를 큰 모델에 넘기는 '올-오어-낫싱' 방식 → RelayLLM은 토큰 단위로 <call> 명령어를 생성해 필요한 순간에만 큰 모델을 호출
Qwen3-1.7B(소형)이 Qwen3-8B(대형)를 선생님으로 삼아 전체 토큰의 1.07%만 대형 모델에서 생성 → 나머지 98.93%는 소형 모델이 직접 처리
GRPO(강화학습 기법)로 2단계 훈련: ①콜드스타트(명령어 문법 학습) → ②난이도 인식 보상으로 '언제 도움을 요청할지' 최적화
난이도별 3가지 보상 설계: 혼자 풀 수 있으면 독립 보너스(r=1.5), 선생님 없이 못 풀면 페널티(r=-1.0), 둘 다 못 풀면 탐색 보상(r=ρ)
수학 훈련 데이터만 썼지만 MMLU-Pro, Big-Bench Hard 등 다른 도메인에서도 일반화 성공 (MMLU-Pro: GRPO 49.76% → RelayLLM 59.03%)
Teacher-Free 평가(호출 완전 차단)에서도 쉬운 벤치마크에선 GRPO 베이스라인 초과 → 협업 훈련이 소형 모델의 내재 추론 능력도 향상

Evidence

6개 수학 벤치마크 평균 정확도: 소형 모델 단독 42.5% → RelayLLM 49.52%, 대형 모델(Qwen3-8B) 54.12%의 격차 중 60% 회복
동일 비용의 Random Router 대비 6.9% 정확도 향상, 성능 동급 라우터 대비 토큰 비용 98.2% 절감
Minerva 벤치마크에서 Qwen3-0.6B 기준 15.81% → 23.53% (48.8% 상대 향상), 대형 모델 호출 비율 0.77%
MMLU-Pro에서 Qwen3-1.7B 기준 GRPO 49.76% vs CITER 53.38% vs RelayLLM 59.03%로 도메인 외 일반화 우위 확인

How to Apply

소형 모델을 주 추론 엔진으로, 대형 모델을 API로 띄운 뒤 소형 모델이 <call>N</call> 토큰을 생성하면 대형 모델에 N개 토큰 생성을 위임하는 stop-sequence 기반 스위칭 구현 (vLLM stop sequence 활용)
같은 모델 패밀리(예: Qwen3-0.6B 학생 + Qwen3-8B 선생)로 구성해야 토크나이저·어휘 분포 일치로 협업 안정성 확보 — 더 큰 외부 모델로 교체하면 분포 불일치로 오히려 성능 하락 가능
훈련 데이터 필터링 필수: 대형 모델도 10번 중 5번 이상 못 푸는 문제는 제거 → 안 하면 call ratio 3배 증가, 정확도는 오히려 하락 (48.76%)

Code Example

snippet

# vLLM 기반 RelayLLM 스위칭 구현 예시
from vllm import LLM, SamplingParams

slm = LLM(model="Qwen/Qwen3-1.7B")
llm = LLM(model="Qwen/Qwen3-8B")

def relay_generate(prompt: str, max_rounds: int = 10) -> str:
    context = prompt
    for _ in range(max_rounds):
        # 소형 모델 생성 (call 토큰에서 멈춤)
        params = SamplingParams(
            max_tokens=512,
            stop=["</call>"],
            include_stop_str_in_output=True
        )
        out = slm.generate([context], params)[0].outputs[0].text
        
        if "<call>" not in out:
            return context + out  # 완료
        
        # call 명령 파싱
        import re
        match = re.search(r"<call>\s*(\d+)\s*</call>", out)
        n_tokens = int(match.group(1)) if match else 50
        
        # call 토큰 제거 후 대형 모델에 전달
        clean_context = re.sub(r"<call>.*?</call>", "", context + out)
        llm_params = SamplingParams(max_tokens=n_tokens)
        llm_out = llm.generate([clean_context], llm_params)[0].outputs[0].text
        
        # 컨텍스트 업데이트 (SLM은 call 토큰 포함 이력 유지)
        context = context + out + llm_out
    
    return context

Terminology

SLMSmall Language Model. 1~2B 파라미터 수준의 작고 빠른 언어 모델. 스마트폰이나 저비용 서버에서 돌릴 수 있지만 복잡한 추론은 버거움.

GRPOGroup Relative Policy Optimization. 강화학습으로 모델을 개선하는 기법. 같은 문제에 여러 답안을 뽑아보고 평균보다 잘한 답에 보상, 못한 답에 페널티를 줘서 점점 더 나은 전략을 학습.

RLVRReinforcement Learning with Verifiable Reward. 정답이 명확히 검증 가능한 태스크(수학 등)에서 규칙 기반으로 보상을 자동 계산하는 강화학습 방식. 사람이 일일이 평가할 필요 없음.

Collaborative Decoding두 개 이상의 모델이 번갈아가며 텍스트를 생성하는 방식. 한 모델이 쓰다가 어려운 부분은 다른 모델에 넘기고 다시 받아서 이어 씀.

Call Ratio전체 생성 토큰 중 대형 모델이 생성한 토큰의 비율. 1%면 99%는 소형 모델이 처리했다는 뜻. 이 값이 낮을수록 비용 효율이 좋음.

Cold Start강화학습 전에 기본 동작 방식을 먼저 지도학습으로 가르치는 단계. 완전히 백지 상태에서 RL을 시작하면 엉뚱한 행동을 하므로 '워밍업'이 필요.

vLLMLLM 추론을 고속으로 서빙하는 오픈소스 엔진. PagedAttention 등 최적화 기법으로 GPU 메모리를 효율적으로 써서 처리량을 크게 높임.

Related Resources

https://github.com/Chengsong-Huang/RelayLLM

Original Abstract (Expand)

Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively"relaying"the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.