Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

Jan 15, 2024•Heming Xia, Zhe Yang, Qingxiu Dong +6•View PDF

TL;DR Highlight

A survey compiling Speculative Decoding techniques that pre-predict tokens with small models and verify in parallel with large models — 2-3x LLM inference speedup.

Who Should Read

Backend/ML engineers wanting to reduce LLM API call costs or response latency. Developers running inference servers like vLLM or TGI or optimizing on-device LLMs.

Core Mechanics

Core Speculative Decoding idea: small draft model pre-generates K tokens, large target LLM verifies them in parallel — reduces per-token memory I/O count for speedup
Two drafter types: (1) external small model (e.g., T5-small to accelerate T5-XXL), (2) self-drafting using the target LLM itself (adding heads like Medusa, EAGLE)
EAGLE tops Spec-Bench overall: average 2.08x speedup on Vicuna-7B at RTX 3090, up to 2.5x on A100. Key: reuses KV cache to reduce drafting overhead
Token Tree Verification: groups multiple candidate sequences into a tree for LLM to verify at once → higher acceptance rate than single-sequence verification (used by SpecInfer, Medusa, EAGLE)
Speedup decreases as sampling temperature rises: EAGLE 2.08x at T=0 drops to 1.74x at T=1. Higher temperature increases speculative sampling computation complexity
Watch for FP16 vs FP32: Speculative Decoding results can subtly differ from regular autoregressive decoding in FP16 due to floating-point error accumulation in long sequences

Evidence

EAGLE: RTX 3090 + Vicuna-7B average 2.08x speedup (math reasoning 2.44x, multi-turn dialogue 2.35x), up to 2.53x on A100 (Vicuna-13B)
Medusa vs EAGLE GPU comparison: 3090 Medusa 1.48x → A100 2.42x (64% improvement), Lookahead 1.11x → 1.77x (59%). More powerful GPU = bigger Speculative Decoding effect
PLD (Prompt Lookup Decoding): 2.41x on summarization but drops to 1.11x on translation. Specialized for tasks with high input-output text similarity
Original SpecDec paper reports 5x speedup (with Transformer-base 65M model). Realistically, 2-3x is typical for general tasks

How to Apply

To attach EAGLE or Medusa to existing LLM serving: fine-tune additional autoregression head on the target model. 2x+ speedup possible with single model, no separate small model management. See GitHub SafeAILab/EAGLE
For quick no-training adoption: use HuggingFace's assisted generation (SpS) feature. Set a smaller model from the same series (e.g., vicuna-68m) as the drafter for 1.5-2x acceleration without additional training
For RAG pipelines or tasks where input and output are similar: PLD (Prompt Lookup Decoding) is simple and effective. Reuses text spans from the input prompt as drafts — very easy to implement

Code Example

snippet

# HuggingFace Assisted Generation (SpS) - Ready to use without additional training
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Target LLM (large model)
target_model = AutoModelForCausalLM.from_pretrained("lmsys/vicuna-7b-v1.3")
# Draft model (small model - same series)
draft_model = AutoModelForCausalLM.from_pretrained("double7/vicuna-68m")
tokenizer = AutoTokenizer.from_pretrained("lmsys/vicuna-7b-v1.3")

inputs = tokenizer("Tell me about speculative decoding", return_tensors="pt")

# Generate with Speculative Decoding (assistant_model parameter is the key)
outputs = target_model.generate(
    **inputs,
    assistant_model=draft_model,  # That's all!
    max_new_tokens=200,
    do_sample=False  # greedy decoding (T=0)
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# PLD (Prompt Lookup Decoding) - Effective for RAG/summarization tasks
# pip install transformers>=4.35
outputs = target_model.generate(
    **inputs,
    prompt_lookup_num_tokens=10,  # Reuse 10 tokens at a time from input as draft
    max_new_tokens=200,
)

Terminology

Autoregressive DecodingThe standard LLM method of generating tokens one at a time sequentially. Hard to parallelize and slow since the next token can't be generated until the previous one is done.

Speculative DecodingA method where a small model 'guesses' multiple tokens ahead, and the large model verifies them all at once. Like a fast assistant writing a draft for the professor to review quickly.

Draft ModelA fast, lightweight small model that pre-generates token candidates instead of the large LLM. Speed over accuracy.

Token Tree VerificationA technique that groups multiple possible token sequences into a tree for the LLM to verify simultaneously in a single forward pass. Like checking multiple paths at once.

Acceptance RateThe ratio of draft tokens the large model accepts out of total draft tokens. Higher acceptance rate = more tokens confirmed per step = faster inference.

Related Resources

Original Abstract (Expand)

To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts several future tokens efficiently and then verifies them in parallel. Unlike autoregressive decoding, Speculative Decoding facilitates the simultaneous decoding of multiple tokens per step, thereby accelerating inference. This paper presents a comprehensive overview and analysis of this promising decoding paradigm. We begin by providing a formal definition and formulation of Speculative Decoding. Then, we organize in-depth discussions on its key facets, such as drafter selection and verification strategies. Furthermore, we present a comparative analysis of leading methods under third-party testing environments. We aim for this work to serve as a catalyst for further research on Speculative Decoding, ultimately contributing to more efficient LLM inference.