TurboQuant: Redefining AI efficiency with extreme compression

TL;DR Highlight

Google Research 2-stage vector compression — PolarQuant + QJL achieves 6x KV cache reduction with zero accuracy loss and 8x attention speedup on H100 GPUs

Who Should Read

ML engineers optimizing LLM inference cost and latency; teams battling KV cache memory bottlenecks in long-context services

Core Mechanics

PolarQuant: randomly rotates vectors then converts to polar coordinates, mapping angle patterns onto a fixed circular grid — eliminates quantization constant storage overhead entirely
QJL (Quantized Johnson-Lindenstrauss): uses just 1 additional bit to detect and correct remaining compression errors — reduces each vector number to a single sign bit (+1/-1)
6x KV cache size reduction, 3-bit quantization, zero training required — zero accuracy loss across all benchmarks
Up to 8x performance increase for attention computation on H100 GPUs vs 32-bit unquantized baseline
Superior recall vs PQ and RabbiQ in vector search — applicable to large-scale ANN search as well

Evidence

H100 GPU benchmark: up to 8x attention computation speedup vs 32-bit unquantized, zero accuracy loss across all downstream benchmarks
Independent llama.cpp and PyTorch implementations published immediately (github.com/mudler/llama.cpp, github.com/tonbistudio/turboquant-pytorch)

How to Apply

Apply TurboQuant to long-context LLM services where KV cache memory is the bottleneck — expect 6x memory reduction and 8x attention speedup
Experiment immediately with the llama.cpp integration — no training required, drop-in for existing models
Also applicable to vector DB (ANN search) performance improvement — better recall than PQ

Terminology

PolarQuantTurboQuant Stage 1 — converts to polar coordinates and maps to a fixed circular grid, eliminating quantization constant storage overhead

QJL (Quantized Johnson-Lindenstrauss)TurboQuant Stage 2 — detects and corrects compression errors using just 1 additional bit, maintaining attention score accuracy

KV Cache (Key-Value Cache)Memory storing attention keys/values from previous tokens during transformer inference — grows explosively with longer context