Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding
TL;DR Highlight
A survey compiling Speculative Decoding techniques that pre-predict tokens with small models and verify in parallel with large models — 2-3x LLM inference speedup.
Who Should Read
Backend/ML engineers wanting to reduce LLM API call costs or response latency. Developers running inference servers like vLLM or TGI or optimizing on-device LLMs.
Core Mechanics
- Core Speculative Decoding idea: small draft model pre-generates K tokens, large target LLM verifies them in parallel — reduces per-token memory I/O count for speedup
- Two drafter types: (1) external small model (e.g., T5-small to accelerate T5-XXL), (2) self-drafting using the target LLM itself (adding heads like Medusa, EAGLE)
- EAGLE tops Spec-Bench overall: average 2.08x speedup on Vicuna-7B at RTX 3090, up to 2.5x on A100. Key: reuses KV cache to reduce drafting overhead
- Token Tree Verification: groups multiple candidate sequences into a tree for LLM to verify at once → higher acceptance rate than single-sequence verification (used by SpecInfer, Medusa, EAGLE)
- Speedup decreases as sampling temperature rises: EAGLE 2.08x at T=0 drops to 1.74x at T=1. Higher temperature increases speculative sampling computation complexity
- Watch for FP16 vs FP32: Speculative Decoding results can subtly differ from regular autoregressive decoding in FP16 due to floating-point error accumulation in long sequences
Evidence
- EAGLE: RTX 3090 + Vicuna-7B average 2.08x speedup (math reasoning 2.44x, multi-turn dialogue 2.35x), up to 2.53x on A100 (Vicuna-13B)
- Medusa vs EAGLE GPU comparison: 3090 Medusa 1.48x → A100 2.42x (64% improvement), Lookahead 1.11x → 1.77x (59%). More powerful GPU = bigger Speculative Decoding effect
- PLD (Prompt Lookup Decoding): 2.41x on summarization but drops to 1.11x on translation. Specialized for tasks with high input-output text similarity
- Original SpecDec paper reports 5x speedup (with Transformer-base 65M model). Realistically, 2-3x is typical for general tasks
How to Apply
- To attach EAGLE or Medusa to existing LLM serving: fine-tune additional autoregression head on the target model. 2x+ speedup possible with single model, no separate small model management. See GitHub SafeAILab/EAGLE
- For quick no-training adoption: use HuggingFace's assisted generation (SpS) feature. Set a smaller model from the same series (e.g., vicuna-68m) as the drafter for 1.5-2x acceleration without additional training
- For RAG pipelines or tasks where input and output are similar: PLD (Prompt Lookup Decoding) is simple and effective. Reuses text spans from the input prompt as drafts — very easy to implement
Code Example
# HuggingFace Assisted Generation (SpS) - Ready to use without additional training
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Target LLM (large model)
target_model = AutoModelForCausalLM.from_pretrained("lmsys/vicuna-7b-v1.3")
# Draft model (small model - same series)
draft_model = AutoModelForCausalLM.from_pretrained("double7/vicuna-68m")
tokenizer = AutoTokenizer.from_pretrained("lmsys/vicuna-7b-v1.3")
inputs = tokenizer("Tell me about speculative decoding", return_tensors="pt")
# Generate with Speculative Decoding (assistant_model parameter is the key)
outputs = target_model.generate(
**inputs,
assistant_model=draft_model, # That's all!
max_new_tokens=200,
do_sample=False # greedy decoding (T=0)
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# PLD (Prompt Lookup Decoding) - Effective for RAG/summarization tasks
# pip install transformers>=4.35
outputs = target_model.generate(
**inputs,
prompt_lookup_num_tokens=10, # Reuse 10 tokens at a time from input as draft
max_new_tokens=200,
)Terminology
Related Resources
Original Abstract (Expand)
To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts several future tokens efficiently and then verifies them in parallel. Unlike autoregressive decoding, Speculative Decoding facilitates the simultaneous decoding of multiple tokens per step, thereby accelerating inference. This paper presents a comprehensive overview and analysis of this promising decoding paradigm. We begin by providing a formal definition and formulation of Speculative Decoding. Then, we organize in-depth discussions on its key facets, such as drafter selection and verification strategies. Furthermore, we present a comparative analysis of leading methods under third-party testing environments. We aim for this work to serve as a catalyst for further research on Speculative Decoding, ultimately contributing to more efficient LLM inference.