Welcome Gemma 4: Frontier multimodal intelligence on device
Foto: Hugging Face Blog
Gemma 4 31B achieves a score of 1,452 points in the LMArena ranking, matching the performance of powerful competitor models despite having nearly thirty times fewer parameters. The new family of multimodal models from Google DeepMind, released on April 2, 2026, under the Apache 2 license, redefines the concept of on-device performance. Thanks to the Mixture-of-Experts (MoE) architecture in the 26B variant, the system activates only 4 billion parameters while offering full support for image, video, and audio analysis with a context window of up to 256,000 tokens. A key innovation is the implementation of Per-Layer Embeddings (PLE) and shared KV Cache memory, which drastically reduces resource requirements while maintaining high precision. For users and developers, this represents a breakthrough in the design of local AI agents – smaller variants (2.3B and 4.5B) effortlessly handle audio processing and variable-aspect-ratio images directly on laptops or smartphones. Full integration with the Hugging Face ecosystem, llama.cpp, MLX, and WebGPU libraries ensures that advanced multimodal intelligence no longer requires costly cloud infrastructure. Gemma 4 thus becomes the foundation for a new generation of responsive, private applications that understand the visual and auditory world in real-time, operating entirely within the user's local environment.
Four sizes, infinite possibilities
Google has decided to diversify its offering by introducing four variants of the model, each addressing different market needs. The key differentiator is the division into dense models and those based on the **Mixture-of-Experts (MoE)** architecture. All versions are available in both a base variant and an instruction-tuned (IT) variant.- Gemma 4 E2B: A model with an effective 2.3B parameters (5.1B with embeddings), offering a 128k context window. It supports text, image, and audio.
- Gemma 4 E4B: A 4.5B parameter version (8B with embeddings), also with a 128k window and full multimodal support (including audio).
- Gemma 4 31B: A powerful dense model with a 256k context window, designed for the most demanding analytical tasks.
- Gemma 4 26B A4B: An MoE architecture, where only 4B out of a total of 26B parameters are active. It offers a 256k window and performance comparable to the largest dense units.

PLE Architecture and Shared KV Cache: Performance Engineering
The success of **Gemma 4** is based on several breakthrough architectural solutions. The most interesting of these is **Per-Layer Embeddings (PLE)**. In traditional transformers, a token receives one embedding vector at the input. PLE introduces an additional, parallel conditioning path of lower dimension that provides a dedicated signal to each decoder layer. This allows the model to specialize layers without having to cram all information into the initial vector. In the case of multimodal data, PLE is calculated before combining visual or audio features with the text sequence. Another pillar of efficiency is the **Shared KV Cache**. In this approach, the final layers of the model do not calculate their own Key and Value projections but instead utilize states from earlier layers of the same attention type. This drastically reduces memory and computational power requirements when generating long text sequences, which is critical for on-device applications.Multimodality in Practice: From OCR to Video Analysis
Despite the lack of full training data specifications, tests show that **Gemma 4** performs excellently in tasks such as optical character recognition (OCR), object detection, and speech-to-text. The models natively support JSON format, making them ideal for "pointing" and "bounding box" tasks without the need for complex prompting instructions.Gemma 4 31B achieves an LMArena score of 1452 points, placing it on par with models such as GLM-5 or Kimi K2.5, despite having approximately 30 times fewer parameters.In graphical user interface (GUI) element detection tests, the model can precisely indicate the coordinates of buttons or text fields, returning data in a structured form. Furthermore, smaller variants (E2B and E4B) demonstrate the ability to understand video along with its accompanying soundtrack. Although the models were not explicitly trained for video sequences, they can correctly interpret on-screen action and musical context, making them incredibly flexible tools for application developers.

