Efficient Memory Management for Large Language Model Serving with PagedAttention
TL;DR Highlight
The vLLM paper applying OS virtual memory techniques to LLM serving — eliminating KV cache memory waste and boosting throughput 2-4x.
Who Should Read
ML engineers running LLM API servers or considering self-hosting. Especially developers limited by GPU memory constraints from increasing batch sizes.
Core Mechanics
- Existing systems pre-allocate KV cache (transformer's token state memory) in contiguous memory at maximum sequence length — actual utilization was only 20-38%
- PagedAttention divides KV cache into fixed-size blocks stored in non-contiguous memory (like OS paging) — nearly eliminating internal and external fragmentation
- Multiple requests sharing the same prompt (parallel sampling, beam search) physically share KV cache blocks with copy-on-write branching — up to 55% memory savings for beam search
- vLLM achieves up to 22x higher throughput than FasterTransformer, 2-4x over Orca (at same latency)
- Two preemption strategies when GPU memory is full: swap to CPU RAM or recompute KV cache
- Supports major models (GPT, OPT, LLaMA) with OpenAI API-compatible interface — drop-in ready
Evidence
- ShareGPT dataset: vLLM handles 1.7-2.7x more request rate than Orca (Oracle), 2.7-8x more than Orca (Max)
- OPT-13B: vLLM processes 2.2x more requests than Orca (Oracle) and 4.3x more than Orca (Max) — average batch size 7 → 30.42
- Beam search (width=6): KV cache block sharing saves 37.6-55.2% memory; parallel sampling saves 6.1-9.8%
- PagedAttention kernel overhead is 20-26% higher than FasterTransformer, but end-to-end performance is overwhelmingly superior
How to Apply
- pip install vllm then run as OpenAI API-compatible server immediately — just change the endpoint without modifying existing GPT API client code
- When multiple requests share the same prefix (like system prompts), use vLLM's shared prefix feature to reuse prefix KV cache without recomputation — up to 3.58x throughput improvement for few-shot prompt serving
- When using parallel sampling (n>1) or beam search, vLLM's advantage over existing systems grows — especially beneficial for code assistant services generating multiple candidates simultaneously
Code Example
# Install vLLM and run OpenAI-compatible server
pip install vllm
# Start server (can be used directly with existing OpenAI client)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-13b-chat-hf \
--tensor-parallel-size 1
# Client code (just change the endpoint)
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
# Parallel sampling - maximize memory efficiency with KV cache sharing
response = client.chat.completions.create(
model="meta-llama/Llama-2-13b-chat-hf",
messages=[{"role": "user", "content": "파이썬으로 피보나치 수열을 구현해줘"}],
n=4, # Generate 4 candidates simultaneously, sharing prompt KV cache
temperature=0.8,
max_tokens=512,
)
for i, choice in enumerate(response.choices):
print(f"--- Candidate {i+1} ---")
print(choice.message.content)
# Direct usage via Python API
from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-13b")
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=256)
prompts = [
"번역해줘: Hello, how are you?",
"번역해줘: What is your name?",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)Terminology
Related Resources
Original Abstract (Expand)
High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2--4× with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM's source code is publicly available at https://github.com/vllm-project/vllm.