TurboQuant: Redefining AI efficiency with extreme compression
TL;DR Highlight
Google Research 2-stage vector compression — PolarQuant + QJL achieves 6x KV cache reduction with zero accuracy loss and 8x attention speedup on H100 GPUs
Who Should Read
ML engineers optimizing LLM inference cost and latency; teams battling KV cache memory bottlenecks in long-context services
Core Mechanics
- PolarQuant: randomly rotates vectors then converts to polar coordinates, mapping angle patterns onto a fixed circular grid — eliminates quantization constant storage overhead entirely
- QJL (Quantized Johnson-Lindenstrauss): uses just 1 additional bit to detect and correct remaining compression errors — reduces each vector number to a single sign bit (+1/-1)
- 6x KV cache size reduction, 3-bit quantization, zero training required — zero accuracy loss across all benchmarks
- Up to 8x performance increase for attention computation on H100 GPUs vs 32-bit unquantized baseline
- Superior recall vs PQ and RabbiQ in vector search — applicable to large-scale ANN search as well
Evidence
- H100 GPU benchmark: up to 8x attention computation speedup vs 32-bit unquantized, zero accuracy loss across all downstream benchmarks
- Independent llama.cpp and PyTorch implementations published immediately (github.com/mudler/llama.cpp, github.com/tonbistudio/turboquant-pytorch)
How to Apply
- Apply TurboQuant to long-context LLM services where KV cache memory is the bottleneck — expect 6x memory reduction and 8x attention speedup
- Experiment immediately with the llama.cpp integration — no training required, drop-in for existing models
- Also applicable to vector DB (ANN search) performance improvement — better recall than PQ
Terminology
PolarQuantTurboQuant Stage 1 — converts to polar coordinates and maps to a fixed circular grid, eliminating quantization constant storage overhead
QJL (Quantized Johnson-Lindenstrauss)TurboQuant Stage 2 — detects and corrects compression errors using just 1 additional bit, maintaining attention score accuracy
KV Cache (Key-Value Cache)Memory storing attention keys/values from previous tokens during transformer inference — grows explosively with longer context