AI5 min readArs Technica AI

Google says new TurboQuant compression can lower AI memory usage without sacrificing quality

P
Redakcja Pixelift1 views
Share
Google says new TurboQuant compression can lower AI memory usage without sacrificing quality

Foto: Google

Eight times greater efficiency with a simultaneous sixfold reduction in memory consumption – these are the results boasted by Google Research during the presentation of TurboQuant. The new compression algorithm addresses one of the most pressing problems of generative artificial intelligence: the immense demand for hardware resources, which drives up component prices and limits model capabilities. The key to success lies in optimizing the so-called key-value cache, the digital "cheat sheet" where Large Language Models store processed data to avoid recomputing it from scratch. The breakthrough involves the application of the PolarQuant technique, which changes the way vectors are encoded. Instead of standard Cartesian coordinates, the system converts data into polar coordinates, reducing it to just two pieces of information: radius and direction. This allows for a drastic reduction in calculation precision (quantization) without the typical drop in the quality of generated content associated with this process. For end users, this signifies a revolution in accessibility: advanced AI models will be able to run faster and more efficiently on weaker hardware, significantly lowering the barrier to entry for creators using local LLM instances. TurboQuant proves that the key to scaling AI does not have to be solely adding more gigabytes of VRAM, but smarter mathematics behind information processing.

In the world of generative artificial intelligence, the fight for hardware resources has become the new gold rush. Anyone who has ever tried to run an advanced language model locally or manage server infrastructure knows that the bottleneck isn't always raw processor power, but the drastic demand for RAM and VRAM. Google Research has just presented a solution that could change the rules of the game regarding energy and hardware efficiency. TurboQuant, a new compression algorithm, promises a radical reduction in the memory footprint of LLM models while simultaneously increasing operational speed, which until now almost always involved a painful compromise in the quality of generated responses.

The problem with modern AI models is that their "intelligence" is based on the mathematical representation of meanings in the form of vectors. These high-dimensional structures can contain hundreds or even thousands of embeddings describing everything from the semantic nuances of text to complex visual data. Storing this information in real-time, especially within the so-called key-value cache, generates an enormous load. Google aptly compares this mechanism to a "digital cheat sheet" that allows the model to avoid constantly recalculating the same data, but the price for this convenience is measured in gigabytes of occupied memory.

End of the precision and performance dilemma

Until now, the standard approach to model optimization has been quantization. This involves lowering the precision of calculations, for example, from 16-bit floating-point numbers to 8-bit or even 4-bit formats. While this allows powerful models to run on consumer hardware, it carries an inevitable degradation in quality. The model becomes less precise in predicting subsequent tokens, which in practice means more chaotic, less logical, or simply incorrect answers. TurboQuant aims to break this pattern by offering tools that allow the AI's "mind" to stay sharp with significantly lower resource requirements.

TurboQuant algorithm performance charts
Early Google tests show a drastic increase in performance while maintaining data precision.

Early test results conducted by Google engineers are impressive. In selected scenarios, they achieved an 8-fold performance increase and a 6-fold reduction in memory consumption. Most importantly, these parameters were achieved without measurable loss in output quality. This means that the barrier to entry for advanced AI systems can be pushed much lower, allowing for more responsive assistants on mobile devices or in edge computing systems, where the memory budget is always strictly limited.

PolarQuant: Innovation in data geometry

The key to TurboQuant's success is a two-stage compression process in which the PolarQuant system plays a central role. Traditionally, vectors in AI models are encoded using standard Cartesian coordinates (XYZ system). This is an intuitive solution from a mathematical point of view but inefficient when attempting aggressive compression. Google decided on a radical change of perspective and converted vectors into polar coordinates within the Cartesian system.

  • Radius: Represents core data strength.
  • Direction: Responsible for the semantic meaning of the information.
  • Efficiency: Reduction of complex data to two key components on a circular coordinate system.

By using a circular grid instead of a rectangular one, PolarQuant is able to map essential data features much more precisely using fewer bits. It is this geometric innovation that allows for the avoidance of errors typical of traditional quantization. Instead of brutally cutting the precision of every number, the algorithm intelligently manages how information is represented, focusing on what is absolutely crucial for the model to maintain logical consistency.

PolarQuant operation diagram
Switching to polar coordinates allows for better optimization of vectors in the cache memory.

A new era for local language models

The introduction of TurboQuant could be of colossal importance for the semiconductor and software industries. Currently, the RAM market is under enormous pressure from data centers building AI clusters, which translates into high prices for end users. If Google's technology is widely implemented, the demand for giant amounts of memory to run models with parameters numbering in the hundreds of billions could drop. This, in turn, opens the door to the true democratization of AI—the technology will cease to be the exclusive domain of only the most powerful server rooms.

"TurboQuant aims to reduce the size of the key-value cache, which directly eliminates performance bottlenecks in currently popular LLM architectures."

It is worth noting the operational context: TurboQuant is not just a theoretical model but a practical tool aimed at optimizing the key-value cache. As model context windows (the amount of text the AI can "remember" in a single conversation) grow, the size of this cache becomes a critical problem. Thanks to a 6-fold reduction in space, developers will be able to offer much longer conversations and analysis of more extensive documents without the need to invest in new hardware infrastructure.

The implementation of algorithms such as TurboQuant heralds a paradigm shift in the development of artificial intelligence. Over the last two years, we have observed a race for "raw power"—more parameters, more data, more memory. Now the industry is entering a phase of optimization and engineering elegance. The possibility of running a GPT-4 class model on hardware that previously barely handled simple chatbots is becoming a real prospect. Google, by betting on innovations in data geometry and precise cache management, is setting a path that other players, such as OpenAI or Anthropic, will likely follow, striving to create AI that is more accessible and less resource-hungry.

Comments

Loading...