Traditional perception systems typically rely on rigid pipelines: a frozen vision backbone generates features, which are then fused with text in a separate decoder. Falcon Perception breaks this pattern in favor of early-fusion. It is a single, autoregressive Transformer that processes image patches and text tokens in the same parameter space starting from the very first layer. Thanks to this approach, the model does not just "see" objects, but interprets them through the prism of the provided prompt from the beginning.

Hybrid Architecture and the Chain-of-Perception Mechanism

At the heart of the model is a unique hybrid attention mask. It solves a fundamental problem: images have a two-dimensional structure and require bidirectional context, while text and task prediction are inherently sequential. In Falcon Perception, image tokens communicate with each other bidirectionally, building a global visual context, while text and task tokens are subject to causal masking. This allows the model to behave like a vision encoder for the image and like a language model for instructions.

Instead of costly pixel-by-pixel mask generation, the Chain-of-Perception interface was implemented. The instance detection and segmentation process has been broken down into three logical steps:

<coord>: Predicting the center of the instance, allowing the model to "anchor" itself to a specific object.
<size>: Determining the spatial extent of the object.
<seg>: Generating a single embedding that, after a dot-product operation with upscaled image features, creates a full-resolution mask.

Falcon Perception model segmentation example — The Falcon Perception model generates precise instance masks based on text prompts.

Localization precision is supported by specialized heads. They utilize Fourier feature encoding, mapping continuous coordinates onto a high-dimensional sinusoidal space. This avoids the so-called spectral bias of neural networks and achieves higher accuracy than traditional discrete coordinate binning.

PBench: Diagnostics instead of simple rankings

The model's creators rightly noted that existing benchmarks, such as RefCOCO, become useless when results exceed 90%. They do not allow for an understanding of why a model fails. The answer is PBench – a new test suite that isolates specific model capabilities across five difficulty levels (L0-L4) and in high-density object scenarios (Dense).

The PBench results reveal a crushing advantage for early-fusion in complex tasks. While at the basic object level (L0) the difference between Falcon Perception and SAM 3 is minimal, in tasks requiring an understanding of spatial relationships (L3) or connections between objects (L4), Falcon's advantage is 21.9 and 15.8 percentage points, respectively. The model also demonstrates remarkable resilience to clutter – in the Dense test (hundreds of instances per image), it achieved a score of 72.6, outclassing general VLM models such as Qwen3-VL-30B.

Burger segmentation in video — Falcon Perception handles the segmentation of objects with specific features, like a burger with a black bun, maintaining consistency across different frames.

Training based on distillation and massive data scale

Achieving such high performance with 600 million parameters would not have been possible without an advanced training process. The model did not start from scratch – distillation from two "teachers" was used: DINOv3 (ViT-H) for local details and SigLIP2 for linguistic alignment. This process provided a solid visual foundation (74.25% zero-shot on ImageNet-1k) before the actual perceptual training stage.

The training dataset is impressive in its scale and quality:

54 million images and 195 million positive expressions.
488 million hard negatives, which is crucial for eliminating hallucinations.
The use of the Muon optimizer for specialized heads, which increased detection scores by 4.8 points on SA-Co.
A rigorous verification process: consensus between models (SAM 3, Qwen3-VL-30B, Moondream3) and human verification in disputed cases.

The training itself was divided into three stages: from scene inventory (Stage 1), through task alignment with query masking (Stage 2), to fine-tuning for long context (Stage 3), which allowed the model to handle up to 600 queries per expression.

Performance in numbers and a new quality of OCR

In the SA-Co benchmark, Falcon Perception achieved a score of 68.0 Macro-F1, significantly ahead of SAM 3 (62.3). Particularly impressive are the gains in semantically difficult categories, such as attributes (+8.2) or food and drink (+12.2). The only area where SAM 3 maintains an advantage is presence calibration (MCC 0.82 vs 0.64), which suggests that Falcon still has a tendency to redundantly draw masks even in the absence of an object.

Parallel to the main model, the team presented Falcon OCR – a compact model with 0.3B parameters. It achieves scores of 80.3 on olmOCR and 88.6 on OmniDocBench, while offering the highest throughput among all available open-source OCR models. This is an excellent addition to the ecosystem, allowing for the instant digitization of documents with minimal computational power requirements.

Falcon Perception proves that the future of precision computer vision lies in integrated architectures that treat image and text as an inseparable whole. The model's success in L2 (OCR-guided) and L3 (Spatial) tests shows that early-fusion is essential for machines to move beyond simple shape recognition and begin to understand contextual instructions. It can be assumed that the "Chain-of-Perception" approach will become the new standard for lightweight but highly capable edge AI models, where parameter efficiency is just as important as masking precision.

Falcon Perception

Hybrid Architecture and the Chain-of-Perception Mechanism

Read also

PBench: Diagnostics instead of simple rankings

Training based on distillation and massive data scale

Performance in numbers and a new quality of OCR

More from Models

Build a Domain-Specific Embedding Model in Under a Day

Nemotron 3 Content Safety 4B: Multimodal, Multilingual Content Moderation

What's New in Mellea 0.4.0 + Granite Libraries Release

Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Related Articles

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

TRL v1.0: Post-Training Library That Holds When the Field Invalidates Its Own Assumptions

A New Framework for Evaluation of Voice Agents (EVA)

Comments

Falcon Perception

Hybrid Architecture and the Chain-of-Perception Mechanism

Read also

PBench: Diagnostics instead of simple rankings

Training based on distillation and massive data scale

Performance in numbers and a new quality of OCR

More from Models

Build a Domain-Specific Embedding Model in Under a Day

Nemotron 3 Content Safety 4B: Multimodal, Multilingual Content Moderation

What's New in Mellea 0.4.0 + Granite Libraries Release

**Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding**

Related Articles

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

TRL v1.0: Post-Training Library That Holds When the Field Invalidates Its Own Assumptions

A New Framework for Evaluation of Voice Agents (EVA)

Comments

Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding