Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability
TL;DR Highlight
A technique that speeds up LLM inference up to 14.4x without any training, based on the observation that attention barely changes within a sentence.
Who Should Read
ML engineers running or optimizing LLM serving infrastructure for long-context workloads (RAG, agents, long-CoT). Teams that want to cut inference costs while reusing existing checkpoints without retraining.
Core Mechanics
- Discovered a 'within-sentence support stability' pattern — the set of tokens that attention focuses on (the support) barely changes during a single sentence's generation
- Leveraged this pattern to use cheap sparse attention (Fast Step) for most decoding steps, and full attention (Slow Step) only near sentence boundaries
- Dense attention info from Slow Steps is processed by a KL divergence-based closed-form Selector to update the sparse KV cache reused by subsequent Fast Steps
- Soft-NMS (preventing duplicate token selection within the same region) and Cross-head exclusivity (preventing multiple heads from selecting the same tokens) ensure selection diversity
- CUDA async pipelines and coalesced sparse attention kernels convert algorithmic savings into real-world speedups
- Works out of the box on all Qwen3-0.6B~235B models without retraining, running on top of vLLM
Evidence
- On Qwen3-4B, speedup scales from 1.91x at 8K context to 14.36x at 128K; while full-KV baseline drops from 759 to 65 tok/s, SFI holds 1400 to 935 tok/s
- Also achieves 13.49x speedup on Qwen3-235B-A22B (8xB200) at 128K context
- On LongBench-V2, reaches first place overall with 34.80 avg score using just 15-20% of tokens — far fewer than other KV compression methods at 50%, even beating full-KV (34.20)
- Qwen3-235B-A22B Thinking model maintains full-KV quality: GPQA 80.80 vs 80.80, MMLU 90.09 vs 90.30
How to Apply
- Drop SFI on top of existing Qwen3 checkpoints in a vLLM serving setup — benefits are most dramatic with long inputs like 128K context
- Set trigger token set (Ttrig) to sentence/paragraph boundary tokens like {'.', '?', '!', ';', newline}, specify a max 64-step refresh budget (Tmax), and it runs without any additional training
- Applicable to long-CoT reasoning (Thinking mode) and multi-agent systems with long generation lengths; start with K=2048 (per KV head selected budget) and tune for quality/speed tradeoff
Code Example
# SFI basic configuration example (based on paper default config)
sfi_config = {
# Managed sparse state configuration
'sink_tokens': 4, # Number of tokens to always retain as global anchors
'recent_window': 256, # Always access the most recent N tokens
'selected_budget_K': 2048, # Number of selected tokens per KV head
# Slow step trigger settings
'trigger_tokens': ['.', '?', '!', ';', '\n'], # Sentence boundary tokens
'max_refresh_interval': 64, # Maximum number of consecutive Fast steps
# Prefill vs Decode observation window
'decode_window_W': 1,
'prefill_window_W': 16,
# Selector hyperparameters
'lambda_clip': 0.02, # Upper bound of prior influence
'alpha_soft': 0.5, # Soft-NMS intensity
'alpha_cross': 0.35, # Cross-head exclusivity intensity
}
# Selector closed-form fusion core logic (Eq.15, 18)
import torch
def selector_fuse(f, r, lambda_clip=0.02):
"""
f: evidence distribution (distribution obtained from slow-step attention)
r: cache-aware prior distribution (based on key norm + position)
Returns: fused score s = (1-lambda*) * f + lambda* * r
"""
# Compute lambda* (Eq.18) - stabilize to prevent distribution from becoming too peaked
f_norm_sq = (f ** 2).sum()
r_norm_sq = (r ** 2).sum()
f_dot_r = (f * r).sum()
numerator = f_norm_sq - f_dot_r
denominator = f_norm_sq - 2 * f_dot_r + r_norm_sq + 1e-8
lambda_star = (numerator / denominator).clamp(0, lambda_clip)
# Arithmetic mixture (reverse-KL solution)
s = (1 - lambda_star) * f + lambda_star * r
return s
# Usage example
# f = softmax(slow_step_logits) # attention distribution obtained from slow step
# r = compute_prior(key_norms, positions) # key norm + position prior
# fused_score = selector_fuse(f, r)
# selected_indices = fused_score.topk(K).indicesTerminology
Related Resources
Original Abstract (Expand)
Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically coherent span, the dominant attention support often remains largely stable. Motivated by this observation, we propose Slow-Fast Inference (SFI), a training-free decoding framework that decouples generation into frequent low-cost fast steps and occasional dense-attention slow steps. Fast steps reuse a compact sparse memory for efficient decoding. Slow steps are triggered near semantic boundaries. At slow steps, the model revisits the broader context and uses the Selector to refresh the selected memory for subsequent fast steps. Across the evaluated context lengths, SFI delivers approximately $1.6\times$--$14.4\times$ higher decoding throughput while generally maintaining quality on par with the full-KV baseline across long-context and long-CoT settings. Because SFI is training-free and applies directly to existing checkpoints, it offers a practical path to reducing inference cost for contemporary autoregressive reasoning models in long-context, long-horizon, and agentic workloads.