Google just shrank a 32-bit vector to 3 bits and says the answers still come out perfect.
TurboQuant, unveiled today by Google Research, is a trio of algorithms—TurboQuant, QJL and PolarQuant—that the company claims can compress the memory-hungry key-value (KV) cache inside large language models to one-tenth the usual size while actually speeding inference up. In internal tests on open-source Gemma and Mistral models, the method delivered a 6× smaller KV footprint and up to 8× faster attention-score computation on NVIDIA H100 GPUs, all without retraining or fine-tuning.
The trick is a two-stage squeeze. First, PolarQuant randomly rotates the vector and slams it through a standard quantizer, burning most of the bit budget on the “main concept.” Second, the leftover error is cleaned up by QJL, a Johnson-Lindenstrauss transform that stores only a single sign bit per value and uses a custom estimator to keep dot-product accuracy intact. Because both stages are “data-oblivious,” Google says there is zero per-dataset tuning and no extra lookup tables—historically the hidden tax that turns 4-bit compression back into 6-bit reality.
Google tested the stack on five long-context benchmarks—LongBench, Needle-in-a-Haystack, ZeroSCROLLS, RULER and L-Eval—and reports perfect recall on the notorious needle task while trimming KV cache to 3 bits. Vector-search experiments show TurboQuant beating two state-of-the-art baselines (PQ and RabbiQ) on 1@k recall even though the baselines were allowed dataset-specific codebooks and TurboQuant was not.
The company is blunt about the scope of its claims: every number comes from Google-run benchmarks, and the work is headed to ICLR 2026 (TurboQuant) and AISTATS 2026 (PolarQuant) for peer review. Still, the team insists the algorithms are “provably efficient” and operate “near theoretical lower bounds,” meaning the compression is not just empirical luck but mathematically guaranteed.
If the results survive outside labs, the payoff is straightforward: denser KV caches let Google (and anyone else) stuff longer contexts onto the same GPU, or serve the same context with cheaper hardware. For vector search—the backbone of semantic Google Search at billion-scale—that translates into indexes that build faster, query faster and fit in less RAM, all without the usual trade-off dance between recall and memory.
Today it’s a paper and a press release. Tomorrow it needs to survive someone else’s cluster, someone else’s dataset and someone else’s skepticism. Until then, Google’s message is clear: 3 bits is enough—if you believe the math.