LOCUS: Local Visual Cue Search for Enhancing Fine-Grained Perception in Multimodal Large Language Models

2026-06-15 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

LOCUS (LOcal visual CUe Search) is a novel training framework designed to enhance fine-grained visual perception in Multimodal Large Language Models (MLLMs). It addresses the "visual context rot" limitation, where MLLMs fail to reliably select and use decisive local evidence even from high-resolution inputs. LOCUS teaches MLLMs to internalize local evidence search through a proxy task during training. This involves providing a local image crop as a visual cue and optimizing the model to recover its spatial support in the full image using an IoU-based reward. Crucially, this visual cue is only used during training, leaving the standard image-question inference interface unchanged. Experiments demonstrate that LOCUS improves localization-sensitive visual understanding, reduces hallucination, and enhances general understanding and reasoning across benchmarks, all while preserving broad MLLM capabilities. Attention analyses further confirm a stronger focus on task-relevant regions.

Key takeaway

For Machine Learning Engineers developing MLLMs for fine-grained visual analysis, LOCUS presents a critical solution to "visual context rot." You should consider integrating training frameworks that teach models to internalize local evidence search, as this approach significantly improves localization-sensitive understanding and reduces hallucination without altering your standard inference pipeline. This method enhances MLLM reliability for tasks requiring precise visual detail.

Key insights

LOCUS enhances MLLM fine-grained perception by internalizing local visual cue search via a training-time proxy task.

Principles

MLLMs struggle with fine-grained visual context rot.
Training with local visual cues improves perception.
IoU-based rewards guide spatial evidence recovery.

Method

LOCUS trains MLLMs by providing a local image crop as a visual cue. The model is optimized to recover the cue's spatial support in the full image using an IoU-based reward, enhancing fine-grained evidence selection.

In practice

Improve MLLM fine-grained perception tasks.
Reduce visual hallucination in MLLM outputs.
Enhance MLLM reasoning via better localization.

Topics

Multimodal Large Language Models
Fine-Grained Perception
Visual Context Rot
Local Visual Cue Search
IoU Reward
MLLM Hallucination

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.