Efficient Memory Management for Large Language Model Serving with PagedAttention

Sep 12, 2023•Woosuk Kwon, Zhuohan Li, Siyuan Zhuang +6•View PDF

TL;DR Highlight

The vLLM paper applying OS virtual memory techniques to LLM serving — eliminating KV cache memory waste and boosting throughput 2-4x.

Who Should Read

ML engineers running LLM API servers or considering self-hosting. Especially developers limited by GPU memory constraints from increasing batch sizes.

Core Mechanics

Existing systems pre-allocate KV cache (transformer's token state memory) in contiguous memory at maximum sequence length — actual utilization was only 20-38%
PagedAttention divides KV cache into fixed-size blocks stored in non-contiguous memory (like OS paging) — nearly eliminating internal and external fragmentation
Multiple requests sharing the same prompt (parallel sampling, beam search) physically share KV cache blocks with copy-on-write branching — up to 55% memory savings for beam search
vLLM achieves up to 22x higher throughput than FasterTransformer, 2-4x over Orca (at same latency)
Two preemption strategies when GPU memory is full: swap to CPU RAM or recompute KV cache
Supports major models (GPT, OPT, LLaMA) with OpenAI API-compatible interface — drop-in ready

Evidence

ShareGPT dataset: vLLM handles 1.7-2.7x more request rate than Orca (Oracle), 2.7-8x more than Orca (Max)
OPT-13B: vLLM processes 2.2x more requests than Orca (Oracle) and 4.3x more than Orca (Max) — average batch size 7 → 30.42
Beam search (width=6): KV cache block sharing saves 37.6-55.2% memory; parallel sampling saves 6.1-9.8%
PagedAttention kernel overhead is 20-26% higher than FasterTransformer, but end-to-end performance is overwhelmingly superior

How to Apply

pip install vllm then run as OpenAI API-compatible server immediately — just change the endpoint without modifying existing GPT API client code
When multiple requests share the same prefix (like system prompts), use vLLM's shared prefix feature to reuse prefix KV cache without recomputation — up to 3.58x throughput improvement for few-shot prompt serving
When using parallel sampling (n>1) or beam search, vLLM's advantage over existing systems grows — especially beneficial for code assistant services generating multiple candidates simultaneously

Code Example

snippet

# Install vLLM and run OpenAI-compatible server
pip install vllm

# Start server (can be used directly with existing OpenAI client)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-13b-chat-hf \
    --tensor-parallel-size 1

# Client code (just change the endpoint)
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)

# Parallel sampling - maximize memory efficiency with KV cache sharing
response = client.chat.completions.create(
    model="meta-llama/Llama-2-13b-chat-hf",
    messages=[{"role": "user", "content": "파이썬으로 피보나치 수열을 구현해줘"}],
    n=4,          # Generate 4 candidates simultaneously, sharing prompt KV cache
    temperature=0.8,
    max_tokens=512,
)

for i, choice in enumerate(response.choices):
    print(f"--- Candidate {i+1} ---")
    print(choice.message.content)

# Direct usage via Python API
from vllm import LLM, SamplingParams

llm = LLM(model="facebook/opt-13b")
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=256)

prompts = [
    "번역해줘: Hello, how are you?",
    "번역해줘: What is your name?",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

Terminology

KV cacheMemory storing key/value computations from previous tokens in transformers. Grows enormously as sequences get longer — the main source of memory waste in LLM serving.

PagedAttentionAn attention algorithm that stores KV cache in fixed-size blocks in non-contiguous memory, like OS page management. The core idea eliminating memory waste.

Internal fragmentationWaste from pre-reserving large memory that doesn't get fully used. Like reserving 10 hotel rooms alone and only using 1.

External fragmentationWhen there's free memory scattered around but no large contiguous space for a new request. Like a parking lot with empty spots but no room for a bus.

PreemptionForcibly evicting low-priority tasks from GPU memory when running out. Like pausing a low-priority download when the internet gets congested.

Related Resources

https://github.com/vllm-project/vllm

Original Abstract (Expand)

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2--4× with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM's source code is publicly available at https://github.com/vllm-project/vllm.