LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

A 2026 study introduces Ghost-100, a new benchmark designed to evaluate tone-induced hallucination in Vision-Language Models (VLMs). This benchmark comprises 800 synthetically generated images across eight categories within three task families: text-illegibility, time-reading, and object-absence. Each image is paired with five prompts from a 5-Level Prompt Intensity Framework, isolating linguistic tone as the independent variable. The evaluation uses a dual-track protocol: a rule-based H-Rate measures the proportion of responses shifting from grounded refusal to unsupported positive commitment, and a GPT-4o-mini-judged H-Score (1-5 scale) quantifies the confidence and specificity of fabrication. The study also releases a three-stage automated validation workflow, confirming 717 of 800 images as compliant. Evaluating nine open-weight VLMs, the research found that H-Rate and H-Score dissociate across models, task types respond differently to prompt pressure, and some models exhibit non-monotonic sensitivity, peaking at intermediate tone levels.

Key takeaway

For AI Engineers and Research Scientists evaluating VLM reliability, you should consider that linguistic tone significantly impacts hallucination behavior, often in non-monotonic ways. Relying solely on aggregate metrics or binary detection can obscure critical model vulnerabilities. You should adopt dual-track evaluation using both hallucination rate and score to gain a comprehensive understanding of how your models negotiate instruction compliance and safety alignment under varying prompt pressures.

Key insights

Linguistic tone systematically influences VLM hallucination, with varying effects on frequency and intensity across models and tasks.

Principles

Method

Ghost-100 uses 800 synthetic images with negative-ground-truth, five prompt intensity levels, and a dual-track H-Rate/H-Score evaluation judged by GPT-4o-mini, supported by a three-stage automated validation workflow.

In practice

Topics

Best for: AI Engineer, Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.