EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
TL;DR Highlight
A Speculative Decoding technique improving draft model architecture to boost LLM inference speed up to 6.5x.
Who Should Read
ML engineers or backend developers wanting to reduce LLM serving costs and latency. Especially useful for those optimizing speed for deploying reasoning models like DeepSeek-R1 in production.
Core Mechanics
- Discovered that EAGLE-2's 'feature prediction constraint' was the bottleneck limiting performance gains with more training data — EAGLE-3 removes this constraint and directly predicts tokens
- Training-time test technique: during training, feeds the draft model's own output as the next step input to simulate test environment — solves distribution mismatch at inference
- Instead of using only top-layer features from the target model, fuses low/mid/high layer features (concat + FC layer) for richer information
- Discovered data scaling law: speed scales proportionally as training data increases 8x — scaling curve first observed in EAGLE family
- LLaMA-Instruct 3.1 8B: ~1.4x additional acceleration over EAGLE-2, max 6.5x speed (HumanEval code generation task)
- 38% throughput improvement at batch size 64 in SGLang framework — existing EAGLE degraded beyond batch 24
Evidence
- EAGLE-3: Vicuna 13B MT-bench 5.58x, HumanEval 6.47x (vs EAGLE-2: 1.31x, 1.30x improvement respectively)
- SGLang + H100, batch size 1: SGLang base 158 tokens/s vs EAGLE-2 244 tokens/s vs EAGLE-3 373 tokens/s
- LLaMA-Instruct 3.1 8B: feature constraint removal alone 3.16x to 3.82x, adding feature fusion 4.40x (ablation study)
- 8x data scaling: EAGLE-2 acceptance length 4.0 vs EAGLE-3 6.0+; EAGLE-2 barely changes
How to Apply
- If using vLLM or SGLang, attach EAGLE-3 draft model for speed improvement without additional code changes — use published weights and code from GitHub.
- For serving reasoning models like DeepSeek-R1 where latency is a problem: add math-specialized data when training EAGLE-3 draft model for domain-specific acceleration.
- For batch serving environments (batch size 16-64) needing throughput improvement: consider EAGLE-3 — existing speculative decoding hurts at large batches, but EAGLE-3 maintains effectiveness up to batch 64.
Code Example
# Example of using EAGLE-3 with SGLang
# 1. Clone the repo
# git clone https://github.com/SafeAILab/EAGLE
# 2. Specify the EAGLE-3 draft model when launching the SGLang server
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--speculative-algorithm EAGLE \
--speculative-draft-model-path [EAGLE-3 draft model path] \
--speculative-num-steps 3 \
--speculative-eagle-topk 8 \
--speculative-num-draft-tokens 16
# 3. Using with vLLM
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
speculative_model="[EAGLE-3 draft model path]",
num_speculative_tokens=3,
use_v2_block_manager=True,
)
sampling_params = SamplingParams(temperature=0, max_tokens=256)
outputs = llm.generate(["Tell me about speculative decoding"], sampling_params)Terminology
Related Resources
Original Abstract (Expand)
The sequential nature of modern LLMs makes them expensive and slow, and speculative sampling has proven to be an effective solution to this problem. Methods like EAGLE perform autoregression at the feature level, reusing top-layer features from the target model to achieve better results than vanilla speculative sampling. A growing trend in the LLM community is scaling up training data to improve model intelligence without increasing inference costs. However, we observe that scaling up data provides limited improvements for EAGLE. We identify that this limitation arises from EAGLE's feature prediction constraints. In this paper, we introduce EAGLE-3, which abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via a technique named training-time test. These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data. Our experiments include both chat models and reasoning models, evaluated on five tasks. The results show that EAGLE-3 achieves a speedup ratio up to 6.5x, with about 1.4x improvement over EAGLE-2. In the SGLang framework, EAGLE-3 achieves a 1.38x throughput improvement at a batch size of 64. The code is available at https://github.com/SafeAILab/EAGLE.