DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

Jan 18, 2024•Yinmin Zhong, Shengyu Liu, Junda Chen +5•View PDF

TL;DR Highlight

A serving architecture that separates LLM inference's Prefill and Decoding stages to different GPUs — handling up to 7.4x more requests than vLLM.

Who Should Read

MLOps/infrastructure engineers wanting to simultaneously improve LLM service response latency (TTFT/TPOT) and GPU cost efficiency. Especially useful for teams over-provisioning GPUs to meet SLOs while running vLLM or DeepSpeed in production.

Core Mechanics

Existing systems like vLLM batch-process Prefill (prompt processing) and Decoding (token generation) on the same GPU, causing interference that degrades both TTFT and TPOT
DistServe completely separates Prefill and Decoding to different GPUs, blocking interference at the source, and independently applies the best parallelism strategy (intra-op vs inter-op) for each stage
Prefill is compute-bound so intra-op parallelism (tensor parallel) is better; Decoding is memory-bandwidth-bound so inter-op parallelism (pipeline parallel) linearly scales throughput
KV Cache transfer overhead from Prefill GPU to Decoding GPU may seem concerning but is actually <0.1% of total latency — negligible with NVLINK
Auto placement algorithm considers TTFT/TPOT SLO conditions and cluster bandwidth to automatically optimize prefill:decoding GPU ratio and parallelism (search time max 1.3 min)
Periodic re-planning triggered when workload patterns change to re-optimize placement strategy for new environment

Evidence

Up to 7.4x more requests processed or 12.6x tighter SLO achieved vs vLLM (90% SLO attainment basis)
OPT-175B ShareGPT workload: KV Cache transfer overhead <0.1% of total latency, 95% of requests experience <30ms transfer latency
Summarization task (OPT-66B): 4.3x higher request throughput and 12.6x tighter SLO support vs vLLM
Simulator accuracy: <2% SLO attainment error vs real system — validating placement algorithm reliability

How to Apply

If vLLM isn't meeting TPOT SLOs: separate deploy Prefill-only instances and Decoding-only instances, apply tensor parallelism for Prefill and pipeline parallelism for Decoding respectively
When serving apps with different SLO requirements simultaneously (chatbot needs low TTFT, document summarization needs low TPOT): set TTFT/TPOT targets per app and use DistServe's placement algorithm to auto-calculate GPU allocation ratios — achieve SLOs without over-provisioning
In clusters with intra-node NVLINK but limited cross-node bandwidth (e.g., 25Gbps): use Low Node-Affinity algorithm to place Prefill/Decoding instances on the same node and transfer KV Cache via NVLINK for minimal overhead

Code Example

snippet

# DistServe GitHub: https://github.com/LLMServe/DistServe

# DistServe deployment example (conceptual flow)
# 1. Workload characteristic profiling
workload = {
    'avg_input_length': 755,   # based on ShareGPT
    'avg_output_length': 200,
    'arrival_rate': 5.0,        # req/s
    'ttft_slo': 2.5,            # seconds (OPT-66B chatbot)
    'tpot_slo': 0.15            # seconds
}

# 2. Search for optimal GPU placement using Placement algorithm
# DistServe automatically finds the configuration below
# OPT-66B ShareGPT result example:
optimal_placement = {
    'prefill_instance': {
        'tensor_parallelism': 4,   # intra-op: to reduce TTFT
        'pipeline_parallelism': 1,
        'num_gpus': 4
    },
    'decoding_instance': {
        'tensor_parallelism': 2,
        'pipeline_parallelism': 2,  # inter-op: linear throughput scaling
        'num_gpus': 4
    },
    'prefill_to_decoding_ratio': '1:1'  # 2:1 also possible depending on workload
}

# 3. KV Cache transfer: After prefill completes, decoding instance fetches via 'pull' method
# (using pull instead of push to prevent memory overload)

# Client request via OpenAI API-compatible interface
import openai
client = openai.OpenAI(
    base_url='http://distserve-endpoint:8000/v1',
    api_key='dummy'
)
response = client.chat.completions.create(
    model='opt-66b',
    messages=[{'role': 'user', 'content': 'Summarize this article...'}],
    max_tokens=200
)

Terminology

TTFTTime To First Token. Time from user sending a question to the first character appearing. The most important metric for how 'fast' a chatbot feels.

TPOTTime Per Output Token. Average time to generate each token after the second one. Faster than human reading speed (~250 wpm) feels sufficiently fast.

GoodputMaximum requests per GPU while meeting SLOs. Unlike simple throughput, it's 'throughput with quality guaranteed' — a better cost efficiency metric.

SLOService Level Objective. A numerical service quality target like '95% of requests must have first token within 0.5 seconds.'

PrefillThe Prefill stage of LLM inference that processes the entire input prompt and generates KV cache. Compute-intensive (compute-bound).

DecodingThe Decoding stage that generates output tokens one at a time. Memory bandwidth-intensive (memory-bandwidth-bound).

Related Resources

DistServe GitHub Repository

Original Abstract (Expand)

DistServe improves the performance of large language models (LLMs) serving by disaggregating the prefill and decoding computation. Existing LLM serving systems colocate the two phases and batch the computation of prefill and decoding across all users and requests. We find that this strategy not only leads to strong prefill-decoding interferences but also couples the resource allocation and parallelism plans for both phases. LLM applications often emphasize individual latency for each phase: time to first token (TTFT) for the prefill phase and time per output token (TPOT) of each request for the decoding phase. In the presence of stringent latency requirements, existing systems have to prioritize one latency over the other, or over-provision compute resources to meet both. DistServe assigns prefill and decoding computation to different GPUs, hence eliminating prefill-decoding interferences. Given the application's TTFT and TPOT requirements, DistServe co-optimizes the resource allocation and parallelism strategy tailored for each phase. DistServe also places the two phases according to the serving cluster's bandwidth to minimize the communication caused by disaggregation. As a result, DistServe significantly improves LLM serving performance in terms of the maximum rate that can be served within both TTFT and TPOT constraints on each GPU. Our evaluations show that on various popular LLMs, applications, and latency requirements, DistServe can serve 7.4x more requests or 12.6x tighter SLO, compared to state-of-the-art systems, while staying within latency constraints for>90% of requests.