Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM

2026-03-11 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Advanced, extended

Summary

Granulon is a novel DINOv3-based Multimodal Large Language Model (MLLM) designed to overcome the limitations of existing visual encoders in handling multi-granularity visual understanding. While CLIP-based encoders excel at global semantic alignment, they struggle with fine-grained details. Conversely, DINOv3 offers strong pixel-level perception but lacks coarse-grained semantic abstraction. Granulon addresses this by introducing a text-conditioned granularity Controller that dynamically adjusts visual abstraction based on textual input, and an Adaptive Token Aggregation (AdaTA) module for granularity-guided pooling and relation-aware clustering. This architecture enables unified "pixel-to-fine-to-coarse" reasoning in a single forward pass. Experiments show Granulon improves accuracy by approximately 30% and reduces hallucination by about 20% across various benchmarks, outperforming other visual encoders under identical settings.

Key takeaway

For AI Engineers developing MLLMs, Granulon offers a robust approach to enhance visual understanding and reduce hallucination. You should consider integrating adaptive granularity modulation, particularly a text-conditioned controller and adaptive token aggregation, into your DINOv3-based visual encoders. This strategy can significantly improve reasoning accuracy and output fidelity across diverse tasks, including medical imaging, by dynamically aligning visual abstraction with linguistic intent.

Key insights

Granulon adaptively modulates visual granularity in MLLMs, enhancing both fine-grained detail and coarse-level semantic abstraction.

Principles

Dynamic granularity improves MLLM accuracy.
Text-conditioned control enhances visual abstraction.
Unified pixel-to-coarse reasoning reduces hallucination.

Method

Granulon uses a text-conditioned Controller to predict optimal visual abstraction and an Adaptive Token Aggregation (AdaTA) module for granularity-guided pooling, feature clustering, and refinement to generate multi-granularity semantic tokens.

In practice

Integrate text-conditioned granularity control.
Employ adaptive token aggregation for multi-scale features.
Prioritize DINOv3 as a base for pixel-level detail.

Topics

Multimodal Large Language Models
Visual Encoders
Adaptive Granularity
DINOv3
Hallucination Reduction

Code references

jinlab-imvr/Granulon

Best for: AI Engineer, Computer Vision Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.