LOCUS: Local Visual Cue Search for Enhancing Fine-Grained Perception in Multimodal Large Language Models
Summary
LOCUS (LOcal visual CUe Search) is a novel training framework designed to enhance fine-grained visual perception in Multimodal Large Language Models (MLLMs). It addresses the "visual context rot" limitation, where MLLMs fail to reliably select and use decisive local evidence even from high-resolution inputs. LOCUS teaches MLLMs to internalize local evidence search through a proxy task during training. This involves providing a local image crop as a visual cue and optimizing the model to recover its spatial support in the full image using an IoU-based reward. Crucially, this visual cue is only used during training, leaving the standard image-question inference interface unchanged. Experiments demonstrate that LOCUS improves localization-sensitive visual understanding, reduces hallucination, and enhances general understanding and reasoning across benchmarks, all while preserving broad MLLM capabilities. Attention analyses further confirm a stronger focus on task-relevant regions.
Key takeaway
For Machine Learning Engineers developing MLLMs for fine-grained visual analysis, LOCUS presents a critical solution to "visual context rot." You should consider integrating training frameworks that teach models to internalize local evidence search, as this approach significantly improves localization-sensitive understanding and reduces hallucination without altering your standard inference pipeline. This method enhances MLLM reliability for tasks requiring precise visual detail.
Key insights
LOCUS enhances MLLM fine-grained perception by internalizing local visual cue search via a training-time proxy task.
Principles
- MLLMs struggle with fine-grained visual context rot.
- Training with local visual cues improves perception.
- IoU-based rewards guide spatial evidence recovery.
Method
LOCUS trains MLLMs by providing a local image crop as a visual cue. The model is optimized to recover the cue's spatial support in the full image using an IoU-based reward, enhancing fine-grained evidence selection.
In practice
- Improve MLLM fine-grained perception tasks.
- Reduce visual hallucination in MLLM outputs.
- Enhance MLLM reasoning via better localization.
Topics
- Multimodal Large Language Models
- Fine-Grained Perception
- Visual Context Rot
- Local Visual Cue Search
- IoU Reward
- MLLM Hallucination
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.