DeepStack Architecture and the Power of Precision Data Injection

The key to the success of IBM's new model is a departure from traditional methods of combining image with text. Most modern VLM models introduce visual information into the neural network at a single, specific point. This causes the model to simultaneously handle general context understanding (e.g., "this is an invoice") and microscopic spatial details (e.g., "this dot is a decimal point in the amount 1,000.00"). Granite 4.0 3B Vision solves this problem through the innovative **DeepStack Injection** architecture. In this approach, abstract visual features are directed to the earlier layers of the model, building a foundation for semantic content understanding. In turn, detailed high-resolution features go to the later layers, allowing for the precision necessary to identify document layout. As a result, the model perfectly "knows" not only what is in the document, but above all – exactly where a given element is located. This is critical for Key-Value Pair (KVP) extraction, where the spatial relationship between a label and an entry field defines data correctness.

Granite 4.0 3B Vision performance chart compared to competitors — Comparison of Granite 4.0 3B Vision performance with larger models in tasks involving converting charts to structural formats.

ChartNet: How to Teach AI to Read Charts

Understanding charts has been a barrier for smaller AI models for years. It requires a combination of visual perception, numerical reasoning, and natural language interpretation. To break this impasse, the IBM research team created **ChartNet** – a powerful synthetic dataset comprising 1.7 million samples. This data consists of more than just simple images; each sample in ChartNet contains code generating the chart, the rendered image, the source table, a text summary, and question-and-answer (QA) pairs. By using a code-driven synthesis pipeline, the model learns deep relationships between data and its visual representation. Granite 4.0 3B Vision not only describes that it sees a "bar chart," but can convert it back into a machine format such as JSON or CSV with accuracy surpassing much larger models. In **Chart2Summary** tests, the model achieved an impressive score of 86.4%, outclassing the competition. In the **Chart2CSV** task, with a score of 62.1%, it was second only to the Qwen3.5-9B model, which is three times its size.

Table and Structural Data Extraction at the Highest Level

Table processing is the "holy grail" of document automation. Granite 4.0 3B Vision has undergone rigorous testing on benchmarks such as **TableVQA-extract**, **OmniDocBench-tables**, and **PubTables-v2**. The results are clear: the model dominates in tasks involving extracting HTML structures from documents. Particularly noteworthy is the score of 92.1 on the TEDS scale for cropped table fragments and 79.3 for full-page documents in the PubTables-v2 benchmark. What sets Granite apart from other solutions is its ability to handle so-called "dirty data" and complex layouts (multi-row, multi-column). The model does not get lost when a table is embedded in dense text or when it has an irregular border. In the **VAREX** test, which simulates real, highly complex US government forms, the model achieved 85.5% accuracy in zero-shot mode (without prior fine-tuning on specific examples).

Table extraction performance analysis — Table extraction test results on various benchmarks – Granite 4.0 3B Vision sets new standards for small models.

Modularity and Synergy with the Docling Ecosystem

IBM focused on implementation practicality. Granite 4.0 3B Vision is delivered as a **LoRA** adapter overlaid on the base language model **Granite 4.0 Micro**. This design allows companies to maintain a single server infrastructure for multiple tasks. If the system processes a text document, it uses the Micro base; if it encounters an image or table, it activates the Vision layer. This drastically reduces VRAM consumption and simplifies data pipeline architecture. This model becomes even more powerful when integrated with the **Docling** tool. In such a duo, the process looks as follows:

Docling is responsible for initial page layout parsing, OCR, and segmentation of visual elements.
Detected tables and charts are "cropped" and sent to Granite 4.0 3B Vision.
The Vision model performs precise data extraction into JSON, CSV, or HTML format.
The final result is a fully searchable, structural document, ready for analysis by BI systems or RAG databases.

A New Performance Standard at Micro Scale

The introduction of Granite 4.0 3B Vision under the **Apache 2.0** license on the HuggingFace platform is a breakthrough moment for open-source AI. IBM proves that optimizing architecture and training data quality (as in the case of ChartNet) is more important than mindlessly scaling the number of parameters. For enterprises, this means the ability to run advanced document analysis locally, on relatively cheap hardware, while maintaining full control over data privacy. In my opinion, the direction taken by IBM – creating small, "sharp" tools instead of large, "blunt" models – will become the dominant trend in 2026. Granite 4.0 3B Vision does not try to be a poet or a programmer; it wants to be the world's best digital archivist and data analyst. And looking at the benchmark results, it is well on its way to achieving that goal. It is a model that not only understands documents but, above all, understands business realities where cost, speed, and error-free performance matter.

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

DeepStack Architecture and the Power of Precision Data Injection

ChartNet: How to Teach AI to Read Charts

Table and Structural Data Extraction at the Highest Level

Modularity and Synergy with the Docling Ecosystem

A New Performance Standard at Micro Scale

More from Models

Build a Domain-Specific Embedding Model in Under a Day

Nemotron 3 Content Safety 4B: Multimodal, Multilingual Content Moderation

What's New in Mellea 0.4.0 + Granite Libraries Release

Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Related Articles

TRL v1.0: Post-Training Library That Holds When the Field Invalidates Its Own Assumptions

A New Framework for Evaluation of Voice Agents (EVA)

Comments

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

DeepStack Architecture and the Power of Precision Data Injection

ChartNet: How to Teach AI to Read Charts

Table and Structural Data Extraction at the Highest Level

Modularity and Synergy with the Docling Ecosystem

A New Performance Standard at Micro Scale

More from Models

Build a Domain-Specific Embedding Model in Under a Day

Nemotron 3 Content Safety 4B: Multimodal, Multilingual Content Moderation

What's New in Mellea 0.4.0 + Granite Libraries Release

**Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding**

Related Articles

TRL v1.0: Post-Training Library That Holds When the Field Invalidates Its Own Assumptions

A New Framework for Evaluation of Voice Agents (EVA)

Comments

Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding