Slow-Fast Inference: 문장 내 Attention 안정성을 이용한 학습 없는 추론 가속

Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability

Mar 12, 2026•Xingyu Xie, Zhaochen Yu, Yue Liao +3•View PDF

TL;DR Highlight

문장 안에서 attention이 거의 안 바뀐다는 관찰로, 학습 없이 LLM 추론 속도를 최대 14.4배 높이는 기법

Who Should Read

긴 컨텍스트(RAG, 에이전트, long-CoT)를 처리하는 LLM 서빙 인프라를 운영하거나 최적화하는 ML 엔지니어. 추론 비용을 줄이고 싶지만 모델 재학습 없이 기존 체크포인트를 그대로 쓰고 싶은 팀.

Core Mechanics

문장 하나가 생성되는 동안 attention이 주목하는 토큰 집합(support)이 거의 바뀌지 않는다는 'within-sentence support stability' 패턴을 발견
이 패턴을 이용해 대부분의 디코딩 스텝은 저렴한 sparse attention(Fast Step)으로, 문장 경계 근처에서만 full attention(Slow Step)으로 처리
Slow Step에서 얻은 dense attention 정보를 KL divergence 기반 closed-form Selector로 가공해, 다음 Fast Step들이 재사용할 sparse KV 캐시를 갱신
Soft-NMS(같은 구역 토큰 중복 선택 방지)와 Cross-head exclusivity(여러 헤드가 동일 토큰 중복 선택 방지) 기법으로 선택 다양성 확보
CUDA 비동기 파이프라인과 메모리 연속 배치(coalesced) sparse attention 커널로 알고리즘 절감을 실제 속도 향상으로 전환
Qwen3-0.6B~235B 전 모델에 재학습 없이 바로 적용 가능하며, vLLM 위에서 동작

Evidence

Qwen3-4B 기준 컨텍스트 8K→128K로 늘어날수록 속도향상 1.91×→14.36×, 풀 KV 베이스라인(759→65 tok/s)과 달리 SFI는 1400→935 tok/s로 거의 유지
Qwen3-235B-A22B(8×B200)에서도 128K 컨텍스트 기준 13.49× 속도향상 달성
LongBench-V2에서 다른 KV 압축 방법들(50% 압축)보다 훨씬 적은 15~20% 토큰만 유지하면서도 평균 34.80점으로 전체 1위, full-KV(34.20)도 초과
Qwen3-235B-A22B Thinking 모델 GPQA 80.80 vs 80.80, MMLU 90.09 vs 90.30으로 full-KV 수준 품질 유지

How to Apply

vLLM 기반 서빙 환경에서 기존 Qwen3 체크포인트에 SFI를 그대로 얹으면 되고, 128K 컨텍스트처럼 긴 입력이 많은 경우 효과가 극대화됨
트리거 토큰 집합(Ttrig)을 {'.', '?', '!', ';', '\n'} 등 문장/단락 경계 토큰으로 설정하고, 최대 64스텝 리프레시 예산(Tmax)을 지정하면 별도 학습 없이 동작
긴 CoT 추론(Thinking 모드)이나 멀티에이전트 시스템처럼 생성 길이 자체가 긴 경우에도 적용 가능하며, K=2048(per KV head selected budget)을 기본값으로 시작해 품질/속도 트레이드오프 조정

Code Example

snippet

# SFI 기본 설정 예시 (논문 default config 기준)
sfi_config = {
    # Managed sparse state 구성
    'sink_tokens': 4,          # 전역 앵커로 항상 유지할 토큰 수
    'recent_window': 256,      # 최근 N개 토큰은 항상 접근
    'selected_budget_K': 2048, # per KV head 선택 토큰 수
    
    # Slow step 트리거 설정
    'trigger_tokens': ['.', '?', '!', ';', '\n'],  # 문장 경계 토큰
    'max_refresh_interval': 64,  # 최대 Fast step 연속 횟수
    
    # Prefill vs Decode 관찰 윈도우
    'decode_window_W': 1,
    'prefill_window_W': 16,
    
    # Selector 하이퍼파라미터
    'lambda_clip': 0.02,    # prior 영향력 상한
    'alpha_soft': 0.5,      # Soft-NMS 강도
    'alpha_cross': 0.35,    # Cross-head exclusivity 강도
}

# Selector closed-form fusion 핵심 로직 (Eq.15, 18)
import torch

def selector_fuse(f, r, lambda_clip=0.02):
    """
    f: evidence distribution (slow-step attention에서 얻은 분포)
    r: cache-aware prior distribution (key norm + position 기반)
    반환: fused score s = (1-lambda*) * f + lambda* * r
    """
    # lambda* 계산 (Eq.18) - 분포가 너무 뾰족해지지 않게 안정화
    f_norm_sq = (f ** 2).sum()
    r_norm_sq = (r ** 2).sum()
    f_dot_r = (f * r).sum()
    
    numerator = f_norm_sq - f_dot_r
    denominator = f_norm_sq - 2 * f_dot_r + r_norm_sq + 1e-8
    lambda_star = (numerator / denominator).clamp(0, lambda_clip)
    
    # Arithmetic mixture (reverse-KL solution)
    s = (1 - lambda_star) * f + lambda_star * r
    return s

# 사용 예
# f = softmax(slow_step_logits)  # slow step에서 얻은 attention 분포
# r = compute_prior(key_norms, positions)  # key norm + position prior
# fused_score = selector_fuse(f, r)
# selected_indices = fused_score.topk(K).indices

Terminology

KV CacheLLM이 이전에 계산한 key/value를 저장해두는 메모리. 매번 재계산하지 않아도 되지만, 컨텍스트가 길수록 메모리와 연산이 폭발적으로 늘어남.

Attention SupportAttention 연산에서 모델이 실제로 '주목하는' 토큰들의 집합. 전체 히스토리 중 높은 attention weight를 받는 소수의 토큰들.

Sparse Attention전체 KV 캐시 대신 선택된 일부 토큰만 보는 attention. 연산량을 줄이지만 잘못 선택하면 품질이 떨어짐.

KL Divergence두 확률 분포가 얼마나 다른지 측정하는 수치. SFI에서는 현재 관찰된 attention 분포와 구조적 사전 지식을 적절히 혼합할 때 사용.

Autoregressive DecodingLLM이 토큰을 하나씩 순서대로 생성하는 방식. 각 스텝마다 이전 모든 토큰을 참조해야 해서 컨텍스트가 길수록 느려짐.

Soft-NMSNon-Maximum Suppression의 부드러운 버전. 같은 동네의 비슷한 후보들 중 1등만 남기고 나머지를 점수 감점시켜 다양한 위치가 선택될 기회를 줌.

MoE (Mixture of Experts)모델 전체가 아닌 일부 전문가 네트워크만 활성화하는 아키텍처. Qwen3-30B-A3B, 235B-A22B가 이 방식이라 sparse attention과 시너지가 좋음.

TritonGPU 커널을 Python스럽게 작성할 수 있게 해주는 컴파일러. FlashAttention, vLLM 등 고성능 LLM 인프라에서 많이 씀.

Related Resources

SFI GitHub Repository

Original Abstract (Expand)

Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically coherent span, the dominant attention support often remains largely stable. Motivated by this observation, we propose Slow-Fast Inference (SFI), a training-free decoding framework that decouples generation into frequent low-cost fast steps and occasional dense-attention slow steps. Fast steps reuse a compact sparse memory for efficient decoding. Slow steps are triggered near semantic boundaries. At slow steps, the model revisits the broader context and uses the Selector to refresh the selected memory for subsequent fast steps. Across the evaluated context lengths, SFI delivers approximately $1.6\times$--$14.4\times$ higher decoding throughput while generally maintaining quality on par with the full-KV baseline across long-context and long-CoT settings. Because SFI is training-free and applies directly to existing checkpoints, it offers a practical path to reducing inference cost for contemporary autoregressive reasoning models in long-context, long-horizon, and agentic workloads.