Four sizes, infinite possibilities

Google has decided to diversify its offering by introducing four variants of the model, each addressing different market needs. The key differentiator is the division into dense models and those based on the **Mixture-of-Experts (MoE)** architecture. All versions are available in both a base variant and an instruction-tuned (IT) variant.

Gemma 4 E2B: A model with an effective 2.3B parameters (5.1B with embeddings), offering a 128k context window. It supports text, image, and audio.
Gemma 4 E4B: A 4.5B parameter version (8B with embeddings), also with a 128k window and full multimodal support (including audio).
Gemma 4 31B: A powerful dense model with a 256k context window, designed for the most demanding analytical tasks.
Gemma 4 26B A4B: An MoE architecture, where only 4B out of a total of 26B parameters are active. It offers a 256k window and performance comparable to the largest dense units.

Visualization of Gemma 4 26B model performance — The Gemma 4 26B A4B model utilizes MoE architecture to deliver high performance at a low computational cost.

PLE Architecture and Shared KV Cache: Performance Engineering

The success of **Gemma 4** is based on several breakthrough architectural solutions. The most interesting of these is **Per-Layer Embeddings (PLE)**. In traditional transformers, a token receives one embedding vector at the input. PLE introduces an additional, parallel conditioning path of lower dimension that provides a dedicated signal to each decoder layer. This allows the model to specialize layers without having to cram all information into the initial vector. In the case of multimodal data, PLE is calculated before combining visual or audio features with the text sequence. Another pillar of efficiency is the **Shared KV Cache**. In this approach, the final layers of the model do not calculate their own Key and Value projections but instead utilize states from earlier layers of the same attention type. This drastically reduces memory and computational power requirements when generating long text sequences, which is critical for on-device applications.

Multimodality in Practice: From OCR to Video Analysis

Despite the lack of full training data specifications, tests show that **Gemma 4** performs excellently in tasks such as optical character recognition (OCR), object detection, and speech-to-text. The models natively support JSON format, making them ideal for "pointing" and "bounding box" tasks without the need for complex prompting instructions.

Gemma 4 31B achieves an LMArena score of 1452 points, placing it on par with models such as GLM-5 or Kimi K2.5, despite having approximately 30 times fewer parameters.

In graphical user interface (GUI) element detection tests, the model can precisely indicate the coordinates of buttons or text fields, returning data in a structured form. Furthermore, smaller variants (E2B and E4B) demonstrate the ability to understand video along with its accompanying soundtrack. Although the models were not explicitly trained for video sequences, they can correctly interpret on-screen action and musical context, making them incredibly flexible tools for application developers.

Image analysis by Gemma 4 E4B model — Smaller model variants, such as E4B, demonstrate impressive precision in object detection and visual analysis tasks.

A New Standard for Local Artificial Intelligence

The introduction of **Gemma 4** is a clear signal that the boundary between cloud models and those running locally is blurring faster than anticipated. Through the use of **Dual RoPE** (standard for local layers and proportional for global ones) and intelligent attention management, these models handle massive context windows while remaining responsive on consumer hardware. The ability to run a "frontier" class performance model on a laptop or smartphone, while simultaneously supporting image and audio, opens a new era in the development of AI agents. **Gemma 4** is not just another iteration; it is a mature ecosystem that, thanks to broad compatibility with libraries like **Unsloth Studio**, **TRL**, and **Vertex AI**, will become the foundation for a new wave of creative and business AI applications. Google DeepMind has proven that the future of artificial intelligence lies not only in massive computing clusters but in intelligent, optimized architecture available to every developer.

Welcome Gemma 4: Frontier multimodal intelligence on device

Four sizes, infinite possibilities

PLE Architecture and Shared KV Cache: Performance Engineering

Multimodality in Practice: From OCR to Video Analysis

A New Standard for Local Artificial Intelligence

More from Models

Build a Domain-Specific Embedding Model in Under a Day

Nemotron 3 Content Safety 4B: Multimodal, Multilingual Content Moderation

What's New in Mellea 0.4.0 + Granite Libraries Release

Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Related Articles

TRL v1.0: Post-Training Library That Holds When the Field Invalidates Its Own Assumptions

A New Framework for Evaluation of Voice Agents (EVA)

Comments

Welcome Gemma 4: Frontier multimodal intelligence on device

Four sizes, infinite possibilities

PLE Architecture and Shared KV Cache: Performance Engineering

Multimodality in Practice: From OCR to Video Analysis

A New Standard for Local Artificial Intelligence

More from Models

Build a Domain-Specific Embedding Model in Under a Day

Nemotron 3 Content Safety 4B: Multimodal, Multilingual Content Moderation

What's New in Mellea 0.4.0 + Granite Libraries Release

**Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding**

Related Articles

TRL v1.0: Post-Training Library That Holds When the Field Invalidates Its Own Assumptions

A New Framework for Evaluation of Voice Agents (EVA)

Comments

Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding