LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models
Summary
A 2026 study introduces Ghost-100, a new benchmark designed to evaluate tone-induced hallucination in Vision-Language Models (VLMs). This benchmark comprises 800 synthetically generated images across eight categories within three task families: text-illegibility, time-reading, and object-absence. Each image is paired with five prompts from a 5-Level Prompt Intensity Framework, isolating linguistic tone as the independent variable. The evaluation uses a dual-track protocol: a rule-based H-Rate measures the proportion of responses shifting from grounded refusal to unsupported positive commitment, and a GPT-4o-mini-judged H-Score (1-5 scale) quantifies the confidence and specificity of fabrication. The study also releases a three-stage automated validation workflow, confirming 717 of 800 images as compliant. Evaluating nine open-weight VLMs, the research found that H-Rate and H-Score dissociate across models, task types respond differently to prompt pressure, and some models exhibit non-monotonic sensitivity, peaking at intermediate tone levels.
Key takeaway
For AI Engineers and Research Scientists evaluating VLM reliability, you should consider that linguistic tone significantly impacts hallucination behavior, often in non-monotonic ways. Relying solely on aggregate metrics or binary detection can obscure critical model vulnerabilities. You should adopt dual-track evaluation using both hallucination rate and score to gain a comprehensive understanding of how your models negotiate instruction compliance and safety alignment under varying prompt pressures.
Key insights
Linguistic tone systematically influences VLM hallucination, with varying effects on frequency and intensity across models and tasks.
Principles
- Hallucination is a modulated behavior, not a static model property.
- Negative-ground-truth benchmarks reduce annotation ambiguity.
- Dual-track metrics (rate and score) offer finer-grained evaluation.
Method
Ghost-100 uses 800 synthetic images with negative-ground-truth, five prompt intensity levels, and a dual-track H-Rate/H-Score evaluation judged by GPT-4o-mini, supported by a three-stage automated validation workflow.
In practice
- Use Ghost-100 to benchmark VLM robustness to linguistic coercion.
- Implement dual-track H-Rate and H-Score for nuanced VLM evaluation.
- Apply the automated validation workflow for scalable dataset auditing.
Topics
- LLM-as-Judge Framework
- Vision-Language Models
- Hallucination Evaluation
- Ghost-100 Benchmark
- Prompt Intensity
Best for: AI Engineer, Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.