Models4 min readHugging Face Blog

Welcome Gemma 4: Frontier multimodal intelligence on device

P
Redakcja Pixelift0 views
Share
Welcome Gemma 4: Frontier multimodal intelligence on device

Foto: Hugging Face Blog

Gemma 4 31B achieves a score of 1,452 points in the LMArena ranking, matching the performance of powerful competitor models despite having nearly thirty times fewer parameters. The new family of multimodal models from Google DeepMind, released on April 2, 2026, under the Apache 2 license, redefines the concept of on-device performance. Thanks to the Mixture-of-Experts (MoE) architecture in the 26B variant, the system activates only 4 billion parameters while offering full support for image, video, and audio analysis with a context window of up to 256,000 tokens. A key innovation is the implementation of Per-Layer Embeddings (PLE) and shared KV Cache memory, which drastically reduces resource requirements while maintaining high precision. For users and developers, this represents a breakthrough in the design of local AI agents – smaller variants (2.3B and 4.5B) effortlessly handle audio processing and variable-aspect-ratio images directly on laptops or smartphones. Full integration with the Hugging Face ecosystem, llama.cpp, MLX, and WebGPU libraries ensures that advanced multimodal intelligence no longer requires costly cloud infrastructure. Gemma 4 thus becomes the foundation for a new generation of responsive, private applications that understand the visual and auditory world in real-time, operating entirely within the user's local environment.

Four sizes, infinite possibilities

Google has decided to diversify its offering by introducing four variants of the model, each addressing different market needs. The key differentiator is the division into dense models and those based on the **Mixture-of-Experts (MoE)** architecture. All versions are available in both a base variant and an instruction-tuned (IT) variant.
  • Gemma 4 E2B: A model with an effective 2.3B parameters (5.1B with embeddings), offering a 128k context window. It supports text, image, and audio.
  • Gemma 4 E4B: A 4.5B parameter version (8B with embeddings), also with a 128k window and full multimodal support (including audio).
  • Gemma 4 31B: A powerful dense model with a 256k context window, designed for the most demanding analytical tasks.
  • Gemma 4 26B A4B: An MoE architecture, where only 4B out of a total of 26B parameters are active. It offers a 256k window and performance comparable to the largest dense units.
Visualization of Gemma 4 26B model performance
The Gemma 4 26B A4B model utilizes MoE architecture to deliver high performance at a low computational cost.

PLE Architecture and Shared KV Cache: Performance Engineering

The success of **Gemma 4** is based on several breakthrough architectural solutions. The most interesting of these is **Per-Layer Embeddings (PLE)**. In traditional transformers, a token receives one embedding vector at the input. PLE introduces an additional, parallel conditioning path of lower dimension that provides a dedicated signal to each decoder layer. This allows the model to specialize layers without having to cram all information into the initial vector. In the case of multimodal data, PLE is calculated before combining visual or audio features with the text sequence. Another pillar of efficiency is the **Shared KV Cache**. In this approach, the final layers of the model do not calculate their own Key and Value projections but instead utilize states from earlier layers of the same attention type. This drastically reduces memory and computational power requirements when generating long text sequences, which is critical for on-device applications.

Multimodality in Practice: From OCR to Video Analysis

Despite the lack of full training data specifications, tests show that **Gemma 4** performs excellently in tasks such as optical character recognition (OCR), object detection, and speech-to-text. The models natively support JSON format, making them ideal for "pointing" and "bounding box" tasks without the need for complex prompting instructions.
Gemma 4 31B achieves an LMArena score of 1452 points, placing it on par with models such as GLM-5 or Kimi K2.5, despite having approximately 30 times fewer parameters.
In graphical user interface (GUI) element detection tests, the model can precisely indicate the coordinates of buttons or text fields, returning data in a structured form. Furthermore, smaller variants (E2B and E4B) demonstrate the ability to understand video along with its accompanying soundtrack. Although the models were not explicitly trained for video sequences, they can correctly interpret on-screen action and musical context, making them incredibly flexible tools for application developers.
Image analysis by Gemma 4 E4B model
Smaller model variants, such as E4B, demonstrate impressive precision in object detection and visual analysis tasks.

A New Standard for Local Artificial Intelligence

The introduction of **Gemma 4** is a clear signal that the boundary between cloud models and those running locally is blurring faster than anticipated. Through the use of **Dual RoPE** (standard for local layers and proportional for global ones) and intelligent attention management, these models handle massive context windows while remaining responsive on consumer hardware. The ability to run a "frontier" class performance model on a laptop or smartphone, while simultaneously supporting image and audio, opens a new era in the development of AI agents. **Gemma 4** is not just another iteration; it is a mature ecosystem that, thanks to broad compatibility with libraries like **Unsloth Studio**, **TRL**, and **Vertex AI**, will become the foundation for a new wave of creative and business AI applications. Google DeepMind has proven that the future of artificial intelligence lies not only in massive computing clusters but in intelligent, optimized architecture available to every developer.

Comments

Loading...