Models5 min readHugging Face Blog

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

P
Redakcja Pixelift0 views
Share
Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

Foto: Hugging Face Blog

As many as 1.7 million diverse charts and diagrams were used to train Granite 4.0 3B Vision – a new, compact model from IBM that redefines how artificial intelligence analyzes corporate documentation. Unveiled on March 31, 2026, this Vision-Language Model (VLM) solution focuses on precise data extraction from tables, charts, and forms, achieving an impressive score of 86.4% in Chart2Summary tests. At the heart of the system is the innovative DeepStack Injection architecture, which separates the processing of semantic features from high-resolution spatial details, allowing the model to understand not only the content but also the complex visual layout of a document. For business users, modularity is key: Granite 4.0 3B Vision functions as a LoRA adapter layered onto the Granite 4.0 Micro base text model. In practice, this means a single deployed instance can seamlessly switch between image analysis and purely text-based tasks, drastically reducing infrastructure costs while maintaining high performance. Through integration with the Docling tool, companies gain a powerful instrument for automating back-office processes, capable of transforming unstructured scans into ready-to-use databases. This represents a clear step toward the democratization of advanced AI in hardware-constrained environments.

DeepStack Architecture and the Power of Precision Data Injection

The key to the success of IBM's new model is a departure from traditional methods of combining image with text. Most modern VLM models introduce visual information into the neural network at a single, specific point. This causes the model to simultaneously handle general context understanding (e.g., "this is an invoice") and microscopic spatial details (e.g., "this dot is a decimal point in the amount 1,000.00"). Granite 4.0 3B Vision solves this problem through the innovative **DeepStack Injection** architecture. In this approach, abstract visual features are directed to the earlier layers of the model, building a foundation for semantic content understanding. In turn, detailed high-resolution features go to the later layers, allowing for the precision necessary to identify document layout. As a result, the model perfectly "knows" not only what is in the document, but above all – exactly where a given element is located. This is critical for Key-Value Pair (KVP) extraction, where the spatial relationship between a label and an entry field defines data correctness.
Granite 4.0 3B Vision performance chart compared to competitors
Comparison of Granite 4.0 3B Vision performance with larger models in tasks involving converting charts to structural formats.

ChartNet: How to Teach AI to Read Charts

Understanding charts has been a barrier for smaller AI models for years. It requires a combination of visual perception, numerical reasoning, and natural language interpretation. To break this impasse, the IBM research team created **ChartNet** – a powerful synthetic dataset comprising 1.7 million samples. This data consists of more than just simple images; each sample in ChartNet contains code generating the chart, the rendered image, the source table, a text summary, and question-and-answer (QA) pairs. By using a code-driven synthesis pipeline, the model learns deep relationships between data and its visual representation. Granite 4.0 3B Vision not only describes that it sees a "bar chart," but can convert it back into a machine format such as JSON or CSV with accuracy surpassing much larger models. In **Chart2Summary** tests, the model achieved an impressive score of 86.4%, outclassing the competition. In the **Chart2CSV** task, with a score of 62.1%, it was second only to the Qwen3.5-9B model, which is three times its size.

Table and Structural Data Extraction at the Highest Level

Table processing is the "holy grail" of document automation. Granite 4.0 3B Vision has undergone rigorous testing on benchmarks such as **TableVQA-extract**, **OmniDocBench-tables**, and **PubTables-v2**. The results are clear: the model dominates in tasks involving extracting HTML structures from documents. Particularly noteworthy is the score of 92.1 on the TEDS scale for cropped table fragments and 79.3 for full-page documents in the PubTables-v2 benchmark. What sets Granite apart from other solutions is its ability to handle so-called "dirty data" and complex layouts (multi-row, multi-column). The model does not get lost when a table is embedded in dense text or when it has an irregular border. In the **VAREX** test, which simulates real, highly complex US government forms, the model achieved 85.5% accuracy in zero-shot mode (without prior fine-tuning on specific examples).
Table extraction performance analysis
Table extraction test results on various benchmarks – Granite 4.0 3B Vision sets new standards for small models.

Modularity and Synergy with the Docling Ecosystem

IBM focused on implementation practicality. Granite 4.0 3B Vision is delivered as a **LoRA** adapter overlaid on the base language model **Granite 4.0 Micro**. This design allows companies to maintain a single server infrastructure for multiple tasks. If the system processes a text document, it uses the Micro base; if it encounters an image or table, it activates the Vision layer. This drastically reduces VRAM consumption and simplifies data pipeline architecture. This model becomes even more powerful when integrated with the **Docling** tool. In such a duo, the process looks as follows:
  • Docling is responsible for initial page layout parsing, OCR, and segmentation of visual elements.
  • Detected tables and charts are "cropped" and sent to Granite 4.0 3B Vision.
  • The Vision model performs precise data extraction into JSON, CSV, or HTML format.
  • The final result is a fully searchable, structural document, ready for analysis by BI systems or RAG databases.

A New Performance Standard at Micro Scale

The introduction of Granite 4.0 3B Vision under the **Apache 2.0** license on the HuggingFace platform is a breakthrough moment for open-source AI. IBM proves that optimizing architecture and training data quality (as in the case of ChartNet) is more important than mindlessly scaling the number of parameters. For enterprises, this means the ability to run advanced document analysis locally, on relatively cheap hardware, while maintaining full control over data privacy. In my opinion, the direction taken by IBM – creating small, "sharp" tools instead of large, "blunt" models – will become the dominant trend in 2026. Granite 4.0 3B Vision does not try to be a poet or a programmer; it wants to be the world's best digital archivist and data analyst. And looking at the benchmark results, it is well on its way to achieving that goal. It is a model that not only understands documents but, above all, understands business realities where cost, speed, and error-free performance matter.

Comments

Loading...