EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Mar 3, 2025•Yuhui Li, Fangyun Wei, Chao Zhang +1•View PDF

TL;DR Highlight

A Speculative Decoding technique improving draft model architecture to boost LLM inference speed up to 6.5x.

Who Should Read

ML engineers or backend developers wanting to reduce LLM serving costs and latency. Especially useful for those optimizing speed for deploying reasoning models like DeepSeek-R1 in production.

Core Mechanics

Discovered that EAGLE-2's 'feature prediction constraint' was the bottleneck limiting performance gains with more training data — EAGLE-3 removes this constraint and directly predicts tokens
Training-time test technique: during training, feeds the draft model's own output as the next step input to simulate test environment — solves distribution mismatch at inference
Instead of using only top-layer features from the target model, fuses low/mid/high layer features (concat + FC layer) for richer information
Discovered data scaling law: speed scales proportionally as training data increases 8x — scaling curve first observed in EAGLE family
LLaMA-Instruct 3.1 8B: ~1.4x additional acceleration over EAGLE-2, max 6.5x speed (HumanEval code generation task)
38% throughput improvement at batch size 64 in SGLang framework — existing EAGLE degraded beyond batch 24

Evidence

EAGLE-3: Vicuna 13B MT-bench 5.58x, HumanEval 6.47x (vs EAGLE-2: 1.31x, 1.30x improvement respectively)
SGLang + H100, batch size 1: SGLang base 158 tokens/s vs EAGLE-2 244 tokens/s vs EAGLE-3 373 tokens/s
LLaMA-Instruct 3.1 8B: feature constraint removal alone 3.16x to 3.82x, adding feature fusion 4.40x (ablation study)
8x data scaling: EAGLE-2 acceptance length 4.0 vs EAGLE-3 6.0+; EAGLE-2 barely changes

How to Apply

If using vLLM or SGLang, attach EAGLE-3 draft model for speed improvement without additional code changes — use published weights and code from GitHub.
For serving reasoning models like DeepSeek-R1 where latency is a problem: add math-specialized data when training EAGLE-3 draft model for domain-specific acceleration.
For batch serving environments (batch size 16-64) needing throughput improvement: consider EAGLE-3 — existing speculative decoding hurts at large batches, but EAGLE-3 maintains effectiveness up to batch 64.

Code Example

snippet

# Example of using EAGLE-3 with SGLang
# 1. Clone the repo
# git clone https://github.com/SafeAILab/EAGLE

# 2. Specify the EAGLE-3 draft model when launching the SGLang server
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --speculative-algorithm EAGLE \
  --speculative-draft-model-path [EAGLE-3 draft model path] \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 8 \
  --speculative-num-draft-tokens 16

# 3. Using with vLLM
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    speculative_model="[EAGLE-3 draft model path]",
    num_speculative_tokens=3,
    use_v2_block_manager=True,
)

sampling_params = SamplingParams(temperature=0, max_tokens=256)
outputs = llm.generate(["Tell me about speculative decoding"], sampling_params)

Terminology

Speculative DecodingA small auxiliary model quickly predicts multiple tokens, then a large model verifies them all at once. Like drafting a document quickly and doing one comprehensive review.

draft modelA much smaller model than the main LLM that quickly generates token candidates. The large model verifies these candidates.

Acceptance RateThe rate at which the target model accepts tokens predicted by the draft model. Higher means faster speed.

Feature FusionCombining vectors from multiple layers inside the model. Earlier layers contain grammar/structure, later layers contain meaning/context.

Training-time TestA technique simulating actual inference conditions during training. Like mock exams in the same environment as the real test.

Autoregressive GenerationThe LLM generating tokens one by one in sequence. Inherently slow.

ThroughputNumber of tokens processable per unit time. A key efficiency metric in batch serving environments.

Related Resources

https://github.com/SafeAILab/EAGLE

Original Abstract (Expand)

The sequential nature of modern LLMs makes them expensive and slow, and speculative sampling has proven to be an effective solution to this problem. Methods like EAGLE perform autoregression at the feature level, reusing top-layer features from the target model to achieve better results than vanilla speculative sampling. A growing trend in the LLM community is scaling up training data to improve model intelligence without increasing inference costs. However, we observe that scaling up data provides limited improvements for EAGLE. We identify that this limitation arises from EAGLE's feature prediction constraints. In this paper, we introduce EAGLE-3, which abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via a technique named training-time test. These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data. Our experiments include both chat models and reasoning models, evaluated on five tasks. The results show that EAGLE-3 achieves a speedup ratio up to 6.5x, with about 1.4x improvement over EAGLE-2. In the SGLang framework, EAGLE-3 achieves a 1.38x throughput improvement at a batch size of 64. The code is available at https://github.com/SafeAILab/EAGLE.