SalArt-VQA: Diagnosing Whether VLMs Understand Salient Artifacts in Generated Images

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Advanced, quick

Summary

SalArt-VQA is a new diagnostic benchmark designed to evaluate vision-language models' (VLMs) understanding of salient artifacts in AI-generated images. Developed to address the limitations of image-level artifact detection, which can mask failures in visual cue reliance or defect description, SalArt-VQA comprises 950 images and 3,681 human-authored multiple-choice questions. These questions span artifact images, real reference images, and generated reference images, utilizing four types: presence detection, semantic localization, spatial grounding, and evidence-grounded defect identification. Testing 20 VLMs, the benchmark revealed that while the strongest model achieved 99.37% detection recall on artifact images, it answered all four artifact-side questions correctly on only 53.26% of images. This highlights a sensitivity-calibration tradeoff, where sensitive models often make unsupported claims, and conservative models miss real artifacts, demonstrating that high detection accuracy does not equate to grounded artifact understanding.

Key takeaway

For Machine Learning Engineers deploying VLMs for AI-generated image quality control, you should not rely solely on image-level artifact detection accuracy. Your evaluation pipeline must incorporate fine-grained diagnostic benchmarks like SalArt-VQA to uncover hidden failures in visual evidence grounding. This will help you identify models that make unsupported claims or miss real artifacts, ensuring your VLM's decisions are truly robust and explainable.

Key insights

High VLM artifact detection accuracy often hides a lack of grounded understanding of visual evidence.

Principles

Method

SalArt-VQA evaluates VLMs using 950 images and 3,681 questions across four types: presence, semantic localization, spatial grounding, and evidence-grounded defect identification, with reference splits for calibration.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.