SpecInfer: Tree-based Speculative Inference와 Verification으로 LLM 서빙 가속화

SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification

May 16, 2023•Xupeng Miao, Gabriele Oliaro, Zhihao Zhang +12•View PDF

TL;DR Highlight

작은 보조 모델들이 미리 예측한 토큰 후보들을 트리 구조로 묶어 큰 LLM이 한 번에 병렬 검증하게 해서 추론 속도를 최대 3.5배 높이는 시스템.

Who Should Read

LLM 서빙 레이턴시를 줄여야 하는 ML 인프라 엔지니어나 MLOps 담당자. 특히 vLLM, TGI 같은 추론 서버를 운영하면서 처리 속도 병목을 겪고 있는 팀.

Core Mechanics

기존 LLM 추론은 토큰 1개씩 순차 생성(autoregressive decoding)이라 느린데, SpecInfer는 작은 모델(SSM)이 예측한 여러 후보 토큰을 트리 구조로 묶어 큰 LLM이 한 번에 병렬 검증하는 방식으로 속도를 올림
SSM 하나의 top-k 토큰을 확장(expansion)하거나, 여러 SSM의 출력을 합치는(merge) 두 가지 방법으로 토큰 트리를 구성 — 트리 너비 5 기준 stochastic decoding 검증 성공률이 52~57%에서 96~97%로 올라감
Tree Attention + Topology-aware Causal Mask 기법으로 트리 내 모든 토큰을 GPU 커널 1번으로 병렬 처리 — 시퀀스별로 별도 커널을 돌리는 기존 방식 대비 커널 론치 오버헤드 대폭 감소
Multi-step Speculative Sampling(MSS)으로 stochastic decoding에서도 LLM 원본과 동일한 출력 분포를 수학적으로 보장하면서, naive sampling 대비 토큰 검증 성공 수를 평균 1.27~1.28배 높임
분산 LLM 추론(multi-GPU)에서 vLLM, HuggingFace TGI, FasterTransformer 대비 1.5~2.8배 빠르고, 오프로딩 기반 추론(FlexGen 대비)에서는 2.6~3.5배 빠름 — 출력 품질은 동일하게 유지
배치 사이즈가 작을수록(BS=1~2) 효과가 극대화되고, 배치가 커질수록 GPU idle 자원이 줄어 효과가 감소 — 실시간 저레이턴시 서빙에 최적화된 기법

Evidence

LLaMA-65B 기준 멀티노드(8× A10 GPU) 분산 추론에서 FasterTransformer 대비 2.4~2.8배 per-token 레이턴시 감소 (Figure 7)
OPT-30B 오프로딩 추론에서 FlexGen 대비 per-token 레이턴시 3.5배(BS=1) ~ 2.7배(BS=16) 감소 (Figure 8)
Stochastic decoding에서 tree width=1(시퀀스 기반) 대비 width=5일 때 검증 성공률 52~57% → 96~97%로 상승 (Table 1)
MSS vs Naive Sampling 비교 시 step당 검증 토큰 수 평균 1.27~1.28배 향상, Alpaca 기준 1.87 → 2.38 토큰/스텝 (Table 3)

How to Apply

vLLM이나 TGI로 LLaMA/OPT 계열 모델을 서빙 중이라면, 같은 모델 패밀리의 경량 버전(예: LLaMA-7B 서빙 시 LLaMA-68M을 SSM으로)을 SSM으로 붙여 SpecInfer로 교체하면 배치 사이즈 1~4 구간에서 즉각적인 레이턴시 개선 가능
GPU 메모리가 부족해 FlexGen처럼 CPU 오프로딩을 쓰고 있다면 SpecInfer의 오프로딩 모드가 특히 효과적 — CPU↔GPU 데이터 전송 횟수 자체를 줄여주므로 OPT-13B/30B 기준 3배 이상 속도 향상 기대
오픈소스 구현체(FlexFlow 기반)가 HuggingFace 모델을 직접 import 지원하므로, https://github.com/flexflow/FlexFlow 에서 클론 후 expansion config(예: <1,1,3,1,1,1,1,1>)를 튜닝해 자신의 모델/데이터셋에 맞는 트리 너비를 실험해볼 수 있음

Code Example

snippet

# SpecInfer 기본 사용 예시 (FlexFlow 기반)
# 설치: git clone --recursive https://github.com/goliaro/specinfer-ae.git

# 모델 다운로드
# ./download_models.sh

# 서버 GPU 실험 실행
# ./server_gpu_experiments.sh

# Python API 예시 (FlexFlow inference 모드)
import flexflow.serve as ff

# LLM(메인 모델) + SSM(소형 추측 모델) 설정
llm = ff.LLM("huggyllama/llama-7b")
ssm = ff.SSM("JackFram/llama-68m")

# 토큰 트리 expansion 설정: <1,1,3,1,1,1,1,1>
# 3번째 스텝에서 width=3으로 분기
generation_config = ff.GenerationConfig(
    do_sample=False,          # greedy decoding
    tree_expansion=[1,1,3,1,1,1,1,1]  # expansion config
)

# 서빙 시작
llm.compile(ssms=[ssm], generation_config=generation_config)
result = llm.generate("Machine learning is")
print(result.output_text)

Terminology

Speculative Decoding큰 모델이 직접 답을 내기 전에, 작은 모델이 먼저 '아마 이런 토큰이 나올 것 같다'고 예측해두고, 큰 모델이 맞는지 한 번에 확인하는 방식. 시험 답안을 보조 강사가 미리 채점해두고 교수가 최종 확인만 하는 것과 비슷.

SSM (Small Speculative Model)LLM보다 100~1000배 작은 경량 모델. 정확도는 낮지만 빠르게 후보 토큰을 제안하는 역할. 본 논문에서는 LLaMA-68M, OPT-125M 등이 SSM으로 사용됨.

Autoregressive DecodingLLM이 토큰을 한 번에 하나씩, 이전 토큰에 의존해 순서대로 생성하는 방식. 빠른 병렬화가 어려워 레이턴시의 주요 원인.

KV Cache (Key-Value Cache)Transformer의 attention 계산 시, 이전에 계산한 Key/Value 값을 메모리에 저장해두고 재사용하는 캐시. LLM 추론 메모리의 상당 부분을 차지함.

Token Tree여러 SSM이 예측한 후보 토큰 시퀀스들을 공통 접두사(prefix)를 공유하는 트리 구조로 합친 것. 루트부터 각 리프까지의 경로가 하나의 후보 시퀀스.

Offloading-based InferenceGPU 메모리가 부족할 때 모델 파라미터를 CPU RAM이나 디스크에 저장해두고 필요한 부분만 GPU로 불러와 계산하는 방식. 저렴하게 큰 모델을 돌릴 수 있지만 CPU↔GPU 전송이 병목.

Tree Attention기존 시퀀스용 attention 메커니즘을 트리 구조에 맞게 확장한 것. 각 노드의 attention을 계산할 때 그 노드의 조상(ancestor) 경로만 참조하도록 causal mask를 트리 토폴로지 기반으로 수정.

Topology-aware Causal Mask트리 내 모든 토큰을 하나의 GPU 커널로 처리하기 위해, 서로 다른 브랜치 간 잘못된 attention을 -∞로 마스킹하는 기법. 이를 통해 여러 시퀀스를 하나의 배치처럼 처리 가능.

Related Resources

Original Abstract (Expand)

This paper introduces SpecInfer, a system that accelerates generative large language model (LLM) serving with tree-based speculative inference and verification. The key idea behind SpecInfer is leveraging small speculative models to predict the LLM's outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence. The correctness of all candidate token sequences represented by a token tree is verified against the LLM in parallel using a novel tree-based parallel decoding mechanism. SpecInfer uses an LLM as a token tree verifier instead of an incremental decoder, which significantly reduces the end-to-end latency and computational requirement for serving generative LLMs while provably preserving model quality. Our evaluation shows that SpecInfer outperforms existing LLM serving systems by 1.5-2.8× for distributed LLM inference and by 2.6-3.5× for offloading-based LLM inference, while preserving the same generative performance. SpecInfer is publicly available at https://github.com/flexflow/FlexFlow/