LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models

2026-04-22 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

A 2026 study introduces Ghost-100, a new benchmark designed to evaluate tone-induced hallucination in Vision-Language Models (VLMs). This benchmark comprises 800 synthetically generated images across eight categories within three task families: text-illegibility, time-reading, and object-absence. Each image is paired with five prompts from a 5-Level Prompt Intensity Framework, isolating linguistic tone as the independent variable. The evaluation uses a dual-track protocol: a rule-based H-Rate measures the proportion of responses shifting from grounded refusal to unsupported positive commitment, and a GPT-4o-mini-judged H-Score (1-5 scale) quantifies the confidence and specificity of fabrication. The study also releases a three-stage automated validation workflow, confirming 717 of 800 images as compliant. Evaluating nine open-weight VLMs, the research found that H-Rate and H-Score dissociate across models, task types respond differently to prompt pressure, and some models exhibit non-monotonic sensitivity, peaking at intermediate tone levels.

Key takeaway

For AI Engineers and Research Scientists evaluating VLM reliability, you should consider that linguistic tone significantly impacts hallucination behavior, often in non-monotonic ways. Relying solely on aggregate metrics or binary detection can obscure critical model vulnerabilities. You should adopt dual-track evaluation using both hallucination rate and score to gain a comprehensive understanding of how your models negotiate instruction compliance and safety alignment under varying prompt pressures.

Key insights

Linguistic tone systematically influences VLM hallucination, with varying effects on frequency and intensity across models and tasks.

Principles

Hallucination is a modulated behavior, not a static model property.
Negative-ground-truth benchmarks reduce annotation ambiguity.
Dual-track metrics (rate and score) offer finer-grained evaluation.

Method

Ghost-100 uses 800 synthetic images with negative-ground-truth, five prompt intensity levels, and a dual-track H-Rate/H-Score evaluation judged by GPT-4o-mini, supported by a three-stage automated validation workflow.

In practice

Use Ghost-100 to benchmark VLM robustness to linguistic coercion.
Implement dual-track H-Rate and H-Score for nuanced VLM evaluation.
Apply the automated validation workflow for scalable dataset auditing.

Topics

LLM-as-Judge Framework
Vision-Language Models
Hallucination Evaluation
Ghost-100 Benchmark
Prompt Intensity

Best for: AI Engineer, Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.