Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability

Mar 12, 2026•Xingyu Xie, Zhaochen Yu, Yue Liao +3•View PDF

TL;DR Highlight

A technique that speeds up LLM inference up to 14.4x without any training, based on the observation that attention barely changes within a sentence.

Who Should Read

ML engineers running or optimizing LLM serving infrastructure for long-context workloads (RAG, agents, long-CoT). Teams that want to cut inference costs while reusing existing checkpoints without retraining.

Core Mechanics

Discovered a 'within-sentence support stability' pattern — the set of tokens that attention focuses on (the support) barely changes during a single sentence's generation
Leveraged this pattern to use cheap sparse attention (Fast Step) for most decoding steps, and full attention (Slow Step) only near sentence boundaries
Dense attention info from Slow Steps is processed by a KL divergence-based closed-form Selector to update the sparse KV cache reused by subsequent Fast Steps
Soft-NMS (preventing duplicate token selection within the same region) and Cross-head exclusivity (preventing multiple heads from selecting the same tokens) ensure selection diversity
CUDA async pipelines and coalesced sparse attention kernels convert algorithmic savings into real-world speedups
Works out of the box on all Qwen3-0.6B~235B models without retraining, running on top of vLLM

Evidence

On Qwen3-4B, speedup scales from 1.91x at 8K context to 14.36x at 128K; while full-KV baseline drops from 759 to 65 tok/s, SFI holds 1400 to 935 tok/s
Also achieves 13.49x speedup on Qwen3-235B-A22B (8xB200) at 128K context
On LongBench-V2, reaches first place overall with 34.80 avg score using just 15-20% of tokens — far fewer than other KV compression methods at 50%, even beating full-KV (34.20)
Qwen3-235B-A22B Thinking model maintains full-KV quality: GPQA 80.80 vs 80.80, MMLU 90.09 vs 90.30

How to Apply

Drop SFI on top of existing Qwen3 checkpoints in a vLLM serving setup — benefits are most dramatic with long inputs like 128K context
Set trigger token set (Ttrig) to sentence/paragraph boundary tokens like {'.', '?', '!', ';', newline}, specify a max 64-step refresh budget (Tmax), and it runs without any additional training
Applicable to long-CoT reasoning (Thinking mode) and multi-agent systems with long generation lengths; start with K=2048 (per KV head selected budget) and tune for quality/speed tradeoff

Code Example

snippet

# SFI basic configuration example (based on paper default config)
sfi_config = {
    # Managed sparse state configuration
    'sink_tokens': 4,          # Number of tokens to always retain as global anchors
    'recent_window': 256,      # Always access the most recent N tokens
    'selected_budget_K': 2048, # Number of selected tokens per KV head
    
    # Slow step trigger settings
    'trigger_tokens': ['.', '?', '!', ';', '\n'],  # Sentence boundary tokens
    'max_refresh_interval': 64,  # Maximum number of consecutive Fast steps
    
    # Prefill vs Decode observation window
    'decode_window_W': 1,
    'prefill_window_W': 16,
    
    # Selector hyperparameters
    'lambda_clip': 0.02,    # Upper bound of prior influence
    'alpha_soft': 0.5,      # Soft-NMS intensity
    'alpha_cross': 0.35,    # Cross-head exclusivity intensity
}

# Selector closed-form fusion core logic (Eq.15, 18)
import torch

def selector_fuse(f, r, lambda_clip=0.02):
    """
    f: evidence distribution (distribution obtained from slow-step attention)
    r: cache-aware prior distribution (based on key norm + position)
    Returns: fused score s = (1-lambda*) * f + lambda* * r
    """
    # Compute lambda* (Eq.18) - stabilize to prevent distribution from becoming too peaked
    f_norm_sq = (f ** 2).sum()
    r_norm_sq = (r ** 2).sum()
    f_dot_r = (f * r).sum()
    
    numerator = f_norm_sq - f_dot_r
    denominator = f_norm_sq - 2 * f_dot_r + r_norm_sq + 1e-8
    lambda_star = (numerator / denominator).clamp(0, lambda_clip)
    
    # Arithmetic mixture (reverse-KL solution)
    s = (1 - lambda_star) * f + lambda_star * r
    return s

# Usage example
# f = softmax(slow_step_logits)  # attention distribution obtained from slow step
# r = compute_prior(key_norms, positions)  # key norm + position prior
# fused_score = selector_fuse(f, r)
# selected_indices = fused_score.topk(K).indices

Terminology

KV CacheMemory storing the key/value pairs LLM computed for previous tokens. Saves recomputation, but memory and compute explode as context length grows.

Attention SupportThe set of tokens that actually get 'attended to' in an attention operation — the small fraction of the history that receives high attention weights.

Sparse AttentionAttention that looks at only a selected subset of the KV cache instead of the full thing. Reduces computation, but quality drops if the wrong tokens are chosen.

KL DivergenceA measure of how different two probability distributions are. Used in SFI to blend the observed attention distribution with structural prior knowledge.

Autoregressive DecodingThe process where LLM generates one token at a time, each depending on all previous tokens.

Related Resources

SFI GitHub Repository

Original Abstract (Expand)

Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically coherent span, the dominant attention support often remains largely stable. Motivated by this observation, we propose Slow-Fast Inference (SFI), a training-free decoding framework that decouples generation into frequent low-cost fast steps and occasional dense-attention slow steps. Fast steps reuse a compact sparse memory for efficient decoding. Slow steps are triggered near semantic boundaries. At slow steps, the model revisits the broader context and uses the Selector to refresh the selected memory for subsequent fast steps. Across the evaluated context lengths, SFI delivers approximately $1.6\times$--$14.4\times$ higher decoding throughput while generally maintaining quality on par with the full-KV baseline across long-context and long-CoT settings. Because SFI is training-free and applies directly to existing checkpoints, it offers a practical path to reducing inference cost for contemporary autoregressive reasoning models in long-context, long-horizon, and agentic workloads.