Models5 min readHugging Face Blog

Falcon Perception

P
Redakcja Pixelift0 views
Share
Falcon Perception

Foto: Hugging Face Blog

Just 600 million parameters were enough for the Falcon Perception model to achieve a 68.0 Macro-F1 score in the SA-Co test, outclassing the significantly larger SAM 3 system, which scored 62.3 points. TII researchers presented a breakthrough early-fusion Transformer architecture that abandons complex pipelines in favor of a single, coherent backbone that simultaneously processes image patches and text. The key to this success is a hybrid attention mechanism and the proprietary Chain-of-Perception method. The model analyzes objects in a logical sequence: from determining coordinates and size to generating a precise segmentation mask. As a result, the system handles open vocabulary and can identify instances based on natural language commands. Simultaneously, Falcon OCR debuted—a lightweight model (0.3B parameters) that offers the highest throughput among open-source solutions, achieving a score of 88.6 on the OmniDocBench benchmark. For users and technology developers, this marks a new era of efficient visual analysis. The ability to run advanced segmentation and text reading with minimal computational resource requirements paves the way for instantaneous real-time image processing, even in crowded and complex scenes. Such high efficiency, combined with the project's open-source nature, provides a viable alternative to heavy, closed commercial models.

Traditional perception systems typically rely on rigid pipelines: a frozen vision backbone generates features, which are then fused with text in a separate decoder. Falcon Perception breaks this pattern in favor of early-fusion. It is a single, autoregressive Transformer that processes image patches and text tokens in the same parameter space starting from the very first layer. Thanks to this approach, the model does not just "see" objects, but interprets them through the prism of the provided prompt from the beginning.

Hybrid Architecture and the Chain-of-Perception Mechanism

At the heart of the model is a unique hybrid attention mask. It solves a fundamental problem: images have a two-dimensional structure and require bidirectional context, while text and task prediction are inherently sequential. In Falcon Perception, image tokens communicate with each other bidirectionally, building a global visual context, while text and task tokens are subject to causal masking. This allows the model to behave like a vision encoder for the image and like a language model for instructions.

Instead of costly pixel-by-pixel mask generation, the Chain-of-Perception interface was implemented. The instance detection and segmentation process has been broken down into three logical steps:

  • <coord>: Predicting the center of the instance, allowing the model to "anchor" itself to a specific object.
  • <size>: Determining the spatial extent of the object.
  • <seg>: Generating a single embedding that, after a dot-product operation with upscaled image features, creates a full-resolution mask.
Falcon Perception model segmentation example
The Falcon Perception model generates precise instance masks based on text prompts.

Localization precision is supported by specialized heads. They utilize Fourier feature encoding, mapping continuous coordinates onto a high-dimensional sinusoidal space. This avoids the so-called spectral bias of neural networks and achieves higher accuracy than traditional discrete coordinate binning.

PBench: Diagnostics instead of simple rankings

The model's creators rightly noted that existing benchmarks, such as RefCOCO, become useless when results exceed 90%. They do not allow for an understanding of why a model fails. The answer is PBench – a new test suite that isolates specific model capabilities across five difficulty levels (L0-L4) and in high-density object scenarios (Dense).

The PBench results reveal a crushing advantage for early-fusion in complex tasks. While at the basic object level (L0) the difference between Falcon Perception and SAM 3 is minimal, in tasks requiring an understanding of spatial relationships (L3) or connections between objects (L4), Falcon's advantage is 21.9 and 15.8 percentage points, respectively. The model also demonstrates remarkable resilience to clutter – in the Dense test (hundreds of instances per image), it achieved a score of 72.6, outclassing general VLM models such as Qwen3-VL-30B.

Burger segmentation in video
Falcon Perception handles the segmentation of objects with specific features, like a burger with a black bun, maintaining consistency across different frames.

Training based on distillation and massive data scale

Achieving such high performance with 600 million parameters would not have been possible without an advanced training process. The model did not start from scratch – distillation from two "teachers" was used: DINOv3 (ViT-H) for local details and SigLIP2 for linguistic alignment. This process provided a solid visual foundation (74.25% zero-shot on ImageNet-1k) before the actual perceptual training stage.

The training dataset is impressive in its scale and quality:

  • 54 million images and 195 million positive expressions.
  • 488 million hard negatives, which is crucial for eliminating hallucinations.
  • The use of the Muon optimizer for specialized heads, which increased detection scores by 4.8 points on SA-Co.
  • A rigorous verification process: consensus between models (SAM 3, Qwen3-VL-30B, Moondream3) and human verification in disputed cases.

The training itself was divided into three stages: from scene inventory (Stage 1), through task alignment with query masking (Stage 2), to fine-tuning for long context (Stage 3), which allowed the model to handle up to 600 queries per expression.

Performance in numbers and a new quality of OCR

In the SA-Co benchmark, Falcon Perception achieved a score of 68.0 Macro-F1, significantly ahead of SAM 3 (62.3). Particularly impressive are the gains in semantically difficult categories, such as attributes (+8.2) or food and drink (+12.2). The only area where SAM 3 maintains an advantage is presence calibration (MCC 0.82 vs 0.64), which suggests that Falcon still has a tendency to redundantly draw masks even in the absence of an object.

Parallel to the main model, the team presented Falcon OCR – a compact model with 0.3B parameters. It achieves scores of 80.3 on olmOCR and 88.6 on OmniDocBench, while offering the highest throughput among all available open-source OCR models. This is an excellent addition to the ecosystem, allowing for the instant digitization of documents with minimal computational power requirements.

Falcon Perception proves that the future of precision computer vision lies in integrated architectures that treat image and text as an inseparable whole. The model's success in L2 (OCR-guided) and L3 (Spatial) tests shows that early-fusion is essential for machines to move beyond simple shape recognition and begin to understand contextual instructions. It can be assumed that the "Chain-of-Perception" approach will become the new standard for lightweight but highly capable edge AI models, where parameter efficiency is just as important as masking precision.

Comments

Loading...