DiCoBench: Benchmarking Multi-Image Fine-Grained Perception via Differential and Commonality Visual Cues

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

DiCoBench is a new, comprehensive, multi-image high-resolution benchmark designed to evaluate Multimodal Large Language Models' (MLLMs) ability to autonomously perceive implicit visual cues. Addressing limitations of existing benchmarks that rely on explicit textual cues or low-resolution inputs, DiCoBench features 765 meticulously curated samples. These samples are categorized into two progressive tracks, Differential Visual Cues and Commonality Visual Cues, covering 8 distinct perception tasks. Formulated as a multiple-choice question task and utilizing imagery approaching 2K resolution, it eliminates evaluation metric bias. Extensive evaluation of 18 diverse MLLMs revealed a striking performance gap compared to human accuracy of 98.3%, indicating current top-performing models struggle significantly with micro-scale detail capture. This benchmark aims to drive future research in autonomous, high-resolution multi-image perception.

Key takeaway

For AI Scientists and Machine Learning Engineers developing MLLMs, DiCoBench reveals a significant gap in fine-grained perception, especially with micro-scale details. You should prioritize research into models capable of autonomously perceiving implicit visual cues across high-resolution, multi-image inputs. Your development efforts must focus on improving accuracy beyond the current 98.3% human benchmark, specifically addressing the challenges posed by differential and commonality visual cues.

Key insights

MLLMs significantly underperform humans in high-resolution, multi-image fine-grained perception, highlighting a critical research gap.

Principles

Method

DiCoBench formulates multi-image fine-grained perception as a multiple-choice question task using high-resolution (approaching 2K) imagery. It categorizes 765 samples into Differential and Commonality Visual Cues across 8 tasks.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.