DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
TL;DR Highlight
A serving architecture that separates LLM inference's Prefill and Decoding stages to different GPUs — handling up to 7.4x more requests than vLLM.
Who Should Read
MLOps/infrastructure engineers wanting to simultaneously improve LLM service response latency (TTFT/TPOT) and GPU cost efficiency. Especially useful for teams over-provisioning GPUs to meet SLOs while running vLLM or DeepSpeed in production.
Core Mechanics
- Existing systems like vLLM batch-process Prefill (prompt processing) and Decoding (token generation) on the same GPU, causing interference that degrades both TTFT and TPOT
- DistServe completely separates Prefill and Decoding to different GPUs, blocking interference at the source, and independently applies the best parallelism strategy (intra-op vs inter-op) for each stage
- Prefill is compute-bound so intra-op parallelism (tensor parallel) is better; Decoding is memory-bandwidth-bound so inter-op parallelism (pipeline parallel) linearly scales throughput
- KV Cache transfer overhead from Prefill GPU to Decoding GPU may seem concerning but is actually <0.1% of total latency — negligible with NVLINK
- Auto placement algorithm considers TTFT/TPOT SLO conditions and cluster bandwidth to automatically optimize prefill:decoding GPU ratio and parallelism (search time max 1.3 min)
- Periodic re-planning triggered when workload patterns change to re-optimize placement strategy for new environment
Evidence
- Up to 7.4x more requests processed or 12.6x tighter SLO achieved vs vLLM (90% SLO attainment basis)
- OPT-175B ShareGPT workload: KV Cache transfer overhead <0.1% of total latency, 95% of requests experience <30ms transfer latency
- Summarization task (OPT-66B): 4.3x higher request throughput and 12.6x tighter SLO support vs vLLM
- Simulator accuracy: <2% SLO attainment error vs real system — validating placement algorithm reliability
How to Apply
- If vLLM isn't meeting TPOT SLOs: separate deploy Prefill-only instances and Decoding-only instances, apply tensor parallelism for Prefill and pipeline parallelism for Decoding respectively
- When serving apps with different SLO requirements simultaneously (chatbot needs low TTFT, document summarization needs low TPOT): set TTFT/TPOT targets per app and use DistServe's placement algorithm to auto-calculate GPU allocation ratios — achieve SLOs without over-provisioning
- In clusters with intra-node NVLINK but limited cross-node bandwidth (e.g., 25Gbps): use Low Node-Affinity algorithm to place Prefill/Decoding instances on the same node and transfer KV Cache via NVLINK for minimal overhead
Code Example
# DistServe GitHub: https://github.com/LLMServe/DistServe
# DistServe deployment example (conceptual flow)
# 1. Workload characteristic profiling
workload = {
'avg_input_length': 755, # based on ShareGPT
'avg_output_length': 200,
'arrival_rate': 5.0, # req/s
'ttft_slo': 2.5, # seconds (OPT-66B chatbot)
'tpot_slo': 0.15 # seconds
}
# 2. Search for optimal GPU placement using Placement algorithm
# DistServe automatically finds the configuration below
# OPT-66B ShareGPT result example:
optimal_placement = {
'prefill_instance': {
'tensor_parallelism': 4, # intra-op: to reduce TTFT
'pipeline_parallelism': 1,
'num_gpus': 4
},
'decoding_instance': {
'tensor_parallelism': 2,
'pipeline_parallelism': 2, # inter-op: linear throughput scaling
'num_gpus': 4
},
'prefill_to_decoding_ratio': '1:1' # 2:1 also possible depending on workload
}
# 3. KV Cache transfer: After prefill completes, decoding instance fetches via 'pull' method
# (using pull instead of push to prevent memory overload)
# Client request via OpenAI API-compatible interface
import openai
client = openai.OpenAI(
base_url='http://distserve-endpoint:8000/v1',
api_key='dummy'
)
response = client.chat.completions.create(
model='opt-66b',
messages=[{'role': 'user', 'content': 'Summarize this article...'}],
max_tokens=200
)Terminology
Related Resources
Original Abstract (Expand)
DistServe improves the performance of large language models (LLMs) serving by disaggregating the prefill and decoding computation. Existing LLM serving systems colocate the two phases and batch the computation of prefill and decoding across all users and requests. We find that this strategy not only leads to strong prefill-decoding interferences but also couples the resource allocation and parallelism plans for both phases. LLM applications often emphasize individual latency for each phase: time to first token (TTFT) for the prefill phase and time per output token (TPOT) of each request for the decoding phase. In the presence of stringent latency requirements, existing systems have to prioritize one latency over the other, or over-provision compute resources to meet both. DistServe assigns prefill and decoding computation to different GPUs, hence eliminating prefill-decoding interferences. Given the application's TTFT and TPOT requirements, DistServe co-optimizes the resource allocation and parallelism strategy tailored for each phase. DistServe also places the two phases according to the serving cluster's bandwidth to minimize the communication caused by disaggregation. As a result, DistServe significantly improves LLM serving performance in terms of the maximum rate that can be served within both TTFT and TPOT constraints on each GPU. Our evaluations show that on various popular LLMs, applications, and latency requirements, DistServe can serve 7.4x more requests or 12.6x tighter SLO, compared to state-of-the-art systems, while staying within latency constraints for>90% of requests.