Industry4 min readThe Register

Google's TurboQuant saves memory, but won't save us from DRAM-pricing hell

P
Redakcja Pixelift0 views
Share
Google's TurboQuant saves memory, but won't save us from DRAM-pricing hell

Foto: The Register

A six-fold reduction in memory requirements during AI model operation is the primary promise of TurboQuant—a new data compression technology from Google. In the face of a three-fold increase in DRAM prices over the past year, this solution emerges as a lifeline for infrastructure budgets, though experts are tempering enthusiasm: the technology will not lower market component prices, but merely allow for more efficient management of them. TurboQuant focuses on optimizing KV caches, the "short-term memory" of LLM models that stores conversation context. Instead of standard 16-bit precision, Google employs a novel approach combining Quantized Johnson-Lindenstrauss (QJL) and PolarQuant. By mapping vectors to a polar coordinate system instead of a Cartesian one, researchers managed to scale down to just 2.5 bits with minimal loss of quality. In practice, this means that computations on NVIDIA H100 GPUs can accelerate by up to eight times. For global inference service providers and model developers, TurboQuant represents a breakthrough in scalability. It allows for handling significantly longer contexts and a higher number of parallel queries on the same hardware. While this technology will not cure the semiconductor market crisis, it drastically pushes the boundaries of what can be extracted from the current generation of AI accelerators. Consequently, the race for algorithmic efficiency is becoming just as vital as the struggle for physical silicon resources.

In a world dominated by rising AI infrastructure costs, every innovation promising a reduction in resource demand is met with immense enthusiasm. When Google unveiled TurboQuant, an AI data compression technology, the industry reacted almost euphorically, seeing it as a rescue from the drastic price hikes of DRAM memory, which has tripled since last year. The reality, however, is more complex: while this technology redefines inference efficiency, it will not necessarily translate into lower component bills.

TurboQuant is an advanced quantization method that tackles one of the most pressing problems of modern Large Language Models (LLMs) — managing the key-value cache, known as KV cache. According to Google researchers, this solution allows for at least a six-fold reduction in memory consumption during the inference process. This is an impressive result, considering that DRAM and NAND memory are currently reaching record prices, and the demand for cloud computing power shows no signs of slowing down.

Intelligent compression instead of model slimming

Unlike traditional quantization methods that focus on reducing the size of the model itself, TurboQuant takes a different target. It focuses on reducing the volume of the KV cache, which acts as the short-term memory of the AI model. This is where the context of an ongoing conversation is stored, allowing the model to "remember" what was discussed several paragraphs earlier. The problem is that in the case of long sessions, this data accumulates rapidly, often taking up more space than the model itself.

By default, the KV cache is stored with 16-bit precision. Reducing this value to eight or four bits theoretically reduces memory requirements by two to four times. Google, however, went a step further. TurboQuant allows for quality similar to the 16-bit BF16 format using just 3.5 bits. In tests on H100 chips, it demonstrated up to an eight-fold acceleration in calculations at 4-bit precision, representing a breakthrough in the performance of attention-type operations.

  • Reduction of memory consumption by a factor of 6:1 thanks to advanced mathematics.
  • The ability to go down to 2.5 bits with minimal loss in the quality of generated content.
  • An 8-fold increase in performance on NVIDIA H100 GPUs when calculating attention logits.
  • Universal application extending beyond LLMs, including vector databases.

Mathematics behind the curtain: PolarQuant and QJL

The success of TurboQuant is based on the combination of two innovative mathematical approaches: PolarQuant and Quantized Johnson-Lindenstrauss (QJL). PolarQuant revolutionizes the way KV cache vectors are stored by mapping them onto a circular grid using polar coordinates instead of the traditional Cartesian system. Instead of describing a point's position via X and Y axes, the system operates with radius and angle, which eliminates the need for costly data normalization.

The use of polar coordinates means that each vector refers to a common reference point, which drastically reduces memory overhead. Precision is maintained by the QJL mechanism, which corrects errors occurring during the first phase of compression. Thanks to this, the model maintains high accuracy of the so-called attention scores, which determine which fragments of context are crucial for providing the correct answer to a user's query.

"This is comparable to replacing the instruction 'go 3 blocks east and 4 blocks north' with the command 'go a total of 5 blocks at a 37-degree angle'" — Google researchers explain in an official statement.

The performance paradox and market realities

Even though TurboQuant offers a spectacular 6:1 compression ratio, investors' hopes for a drop in memory prices may prove futile. While this technology makes AI inference clusters more efficient and cheaper to operate, the history of technology teaches that increased efficiency rarely leads to a decrease in demand. On the contrary — it enables the realization of projects that were previously economically unjustifiable.

Just a year ago, open models like DeepSeek R1 offered context windows ranging from 64 to 256 thousand tokens. Today, models supporting over a million tokens are becoming the standard. In the face of the growing popularity of coding assistants and agentic systems like OpenClaw, AI service providers will likely use the savings from TurboQuant to offer even larger context windows instead of buying fewer DRAM sticks.

Analyses from TrendForce confirm this thesis: TurboQuant may paradoxically drive memory demand by stimulating the development of applications requiring massive context. Instead of a price reduction, we are therefore facing the next phase of an arms race, in which the gained space will be immediately filled with new data. TurboQuant is a powerful tool in the hands of engineers, but in a clash with the market dynamics of DRAM prices, it remains merely a bandage on a deep wound of infrastructure costs.

Source: The Register
Share

Comments

Loading...