KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse

Feb 21, 2025•Jingbo Yang, Bairu Hou, Wei Wei +2•View PDF

TL;DR Highlight

Eliminating the waste of re-encoding the same documents every query in RAG by precomputing and reusing document-level KV caches with minimal accuracy loss.

Who Should Read

Backend/ML engineers wanting to reduce inference costs and latency when the same documents are reused across multiple queries in RAG pipelines. Developers optimizing LLM serving infrastructure or improving Time-to-First-Token.

Core Mechanics

Precomputes and stores KV caches for each document independently, then loads and concatenates them at query time to eliminate redundant encoding
Solves positional information mismatch in independent encoding by re-applying RoPE (Rotary Position Embedding) — remove positions at storage time, recompute with actual positions at inference
Introduced learnable special 'Link Tokens' inserted between documents to restore attention disconnection between independently encoded documents
Outperforms best existing method (BlockAttention) by avg 4%+ across Llama-3.2-1B/3B and Llama-3.1-8B
Storing KV caches on CPU and loading to GPU reduces TTFT by up to 96% vs standard decoding
Compatible with KV cache compression techniques like LLMLinGUA and ANLLMS for additional storage reduction

Evidence

NaturalQuestions: 6.6% improvement over BlockAttention; HotpotQA: 7.3% improvement (Llama-3.2-1B)
At 5,000 context tokens: TTFT standard decoding 0.885s vs KVLink 0.027s (96% reduction)
Existing methods (PROMPTCACHE, CacheBlend) reported up to 35% accuracy drop; KVLink5 within avg 1-2%p of full encoding upper bound
Llama3.1-8B A100: standard decoding $440/1M requests vs KVLink $16 (~27x savings)

How to Apply

In RAG systems: precompute KV caches for each knowledge base document independently, store them, then at query time load only relevant document caches and concatenate.
Link Token count (0/1/5) adjusts accuracy-speed tradeoff — KVLINK0 for speed priority, KVLINK5 for accuracy priority.
If storage burden is high: first compress documents with LLMLinGUA (50-75%), cache only compressed KV, or combine with LRU/LFU strategy to cache only frequently used documents.

Code Example

snippet

# KVLink application flow (conceptual code)

# 1. Offline: Pre-compute KV Cache per document
for doc in knowledge_base:
    # Encode document standalone (remove positional embeddings before saving)
    kv_cache = model.encode_document(doc, remove_position_embedding=True)
    cache_store[doc.id] = kv_cache  # Store in CPU memory or on disk

# 2. Online inference: Load cache → Reapply positions → Compute link tokens
def kvlink_inference(query, retrieved_doc_ids):
    # Load pre-computed KV Cache
    cached_kvs = [cache_store[doc_id] for doc_id in retrieved_doc_ids]
    
    # Reapply positional embeddings (aligned to each document's actual global position)
    repositioned_kvs = reapply_rope(cached_kvs)
    
    # Compute link token KVs (small amount computed at runtime)
    link_token_kvs = model.compute_link_tokens(repositioned_kvs, num_link_tokens=5)
    
    # Final KV = [repositioned document cache] + [link token KVs]
    full_kv = concat(repositioned_kvs, link_token_kvs)
    
    # Encode only the query fresh
    return model.generate(query, kv_cache=full_kv)

Terminology

KV CacheStorage for Key/Value matrices from previously processed tokens in a Transformer. Reusing it avoids recalculating the same content.

RoPERotary Position Embedding. Encodes position info into each token using rotation matrices. KVLink removes position info at storage time and re-attaches later.

TTFTTime-to-First-Token. The time from user input to the LLM's first character output. Directly tied to perceived user response speed.

PrefillingThe phase where LLM processes the entire input prompt before generating a response.

RAGRetrieval-Augmented Generation. Retrieves relevant documents and passes them with the question to the LLM.

BlockAttentionAn existing method that directly uses independently encoded document KV caches. The strongest baseline.

Link TokenA learnable special token inserted between independently encoded documents, acting as a bridge for attention across documents.

Related Resources

https://github.com/UCSB-NLP-Chang/KVLink

Original Abstract (Expand)

We describe KVLink, an approach for efficient key-value (KV) cache reuse in large language models (LLMs). In many LLM applications, different inputs can share overlapping context, such as the same retrieved document appearing in multiple queries. However, the LLMs still need to encode the entire context for each query, leading to redundant computation. In this paper, we investigate a new strategy to eliminate such inefficiency, where the KV cache of each document is precomputed independently. During inference, the KV caches of retrieved documents are concatenated, allowing the model to reuse cached representations instead of recomputing them. To mitigate the performance degradation when using KV caches computed independently for each document, KVLink introduces two key techniques: adjusting positional embeddings of the KV cache at inference to match the global position after concatenation, and using trainable special tokens to restore self-attention across independently encoded documents. Experiments across 7 datasets demonstrate that KVLink improves question answering accuracy by an average of 4% over state-of-the-art methods. Furthermore, by leveraging precomputed KV caches, our approach reduces time-to-first-token by up to 96% compared to standard LLM inference, making it a scalable and efficient solution for context reuse. Additionally, KVLink can be combined with KV cache compression to further save cache loading and storage overhead while outperforming the baselines.