KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse
TL;DR Highlight
Eliminating the waste of re-encoding the same documents every query in RAG by precomputing and reusing document-level KV caches with minimal accuracy loss.
Who Should Read
Backend/ML engineers wanting to reduce inference costs and latency when the same documents are reused across multiple queries in RAG pipelines. Developers optimizing LLM serving infrastructure or improving Time-to-First-Token.
Core Mechanics
- Precomputes and stores KV caches for each document independently, then loads and concatenates them at query time to eliminate redundant encoding
- Solves positional information mismatch in independent encoding by re-applying RoPE (Rotary Position Embedding) — remove positions at storage time, recompute with actual positions at inference
- Introduced learnable special 'Link Tokens' inserted between documents to restore attention disconnection between independently encoded documents
- Outperforms best existing method (BlockAttention) by avg 4%+ across Llama-3.2-1B/3B and Llama-3.1-8B
- Storing KV caches on CPU and loading to GPU reduces TTFT by up to 96% vs standard decoding
- Compatible with KV cache compression techniques like LLMLinGUA and ANLLMS for additional storage reduction
Evidence
- NaturalQuestions: 6.6% improvement over BlockAttention; HotpotQA: 7.3% improvement (Llama-3.2-1B)
- At 5,000 context tokens: TTFT standard decoding 0.885s vs KVLink 0.027s (96% reduction)
- Existing methods (PROMPTCACHE, CacheBlend) reported up to 35% accuracy drop; KVLink5 within avg 1-2%p of full encoding upper bound
- Llama3.1-8B A100: standard decoding $440/1M requests vs KVLink $16 (~27x savings)
How to Apply
- In RAG systems: precompute KV caches for each knowledge base document independently, store them, then at query time load only relevant document caches and concatenate.
- Link Token count (0/1/5) adjusts accuracy-speed tradeoff — KVLINK0 for speed priority, KVLINK5 for accuracy priority.
- If storage burden is high: first compress documents with LLMLinGUA (50-75%), cache only compressed KV, or combine with LRU/LFU strategy to cache only frequently used documents.
Code Example
# KVLink application flow (conceptual code)
# 1. Offline: Pre-compute KV Cache per document
for doc in knowledge_base:
# Encode document standalone (remove positional embeddings before saving)
kv_cache = model.encode_document(doc, remove_position_embedding=True)
cache_store[doc.id] = kv_cache # Store in CPU memory or on disk
# 2. Online inference: Load cache → Reapply positions → Compute link tokens
def kvlink_inference(query, retrieved_doc_ids):
# Load pre-computed KV Cache
cached_kvs = [cache_store[doc_id] for doc_id in retrieved_doc_ids]
# Reapply positional embeddings (aligned to each document's actual global position)
repositioned_kvs = reapply_rope(cached_kvs)
# Compute link token KVs (small amount computed at runtime)
link_token_kvs = model.compute_link_tokens(repositioned_kvs, num_link_tokens=5)
# Final KV = [repositioned document cache] + [link token KVs]
full_kv = concat(repositioned_kvs, link_token_kvs)
# Encode only the query fresh
return model.generate(query, kv_cache=full_kv)Terminology
Related Resources
Original Abstract (Expand)
We describe KVLink, an approach for efficient key-value (KV) cache reuse in large language models (LLMs). In many LLM applications, different inputs can share overlapping context, such as the same retrieved document appearing in multiple queries. However, the LLMs still need to encode the entire context for each query, leading to redundant computation. In this paper, we investigate a new strategy to eliminate such inefficiency, where the KV cache of each document is precomputed independently. During inference, the KV caches of retrieved documents are concatenated, allowing the model to reuse cached representations instead of recomputing them. To mitigate the performance degradation when using KV caches computed independently for each document, KVLink introduces two key techniques: adjusting positional embeddings of the KV cache at inference to match the global position after concatenation, and using trainable special tokens to restore self-attention across independently encoded documents. Experiments across 7 datasets demonstrate that KVLink improves question answering accuracy by an average of 4% over state-of-the-art methods. Furthermore, by leveraging precomputed KV caches, our approach reduces time-to-first-token by up to 96% compared to standard LLM inference, making it a scalable and efficient solution for context reuse. Additionally, KVLink can be combined with KV cache compression to further save cache loading and storage overhead while outperforming the baselines.