Look Again Before You Abstain:Budgeted Conformal Evidence Acquisition for Reliable Vision-Language Model
Summary
Large Vision-Language Models (LVLMs) frequently hallucinate, asserting visual details not supported by images. Existing selective prediction methods, while offering distribution-free guarantees on hallucination rates, achieve this at a high cost, requiring abstention on over 80% of claims to maintain a hallucination rate below 5% on object-existence benchmarks. To mitigate this waste, Budgeted Conformal Evidence Acquisition (BCEA) introduces a three-way decision: answer, abstain, or acquire additional visual evidence through re-examination (zooming, cropping, claim-specific interventions) under a bounded compute budget. A critical observation is that naive evidence acquisition breaks conformal calibration's statistical guarantees, causing realized risk to overshoot the target by up to 17 points. BCEA addresses this by folding the entire acquisition policy into the score function and re-calibrating, which restores finite-sample guarantees and improves coverage. Tested on POPE and COCO benchmarks with four open VLMs, BCEA effectively controls hallucination rates and consistently enhances coverage over guaranteed-abstention baselines.
Key takeaway
For Machine Learning Engineers deploying Large Vision-Language Models, if you are struggling with high abstention rates while maintaining hallucination guarantees, consider implementing Budgeted Conformal Evidence Acquisition (BCEA). This approach allows your models to acquire additional visual evidence, such as zooming or cropping, within a compute budget, significantly improving coverage without sacrificing statistical reliability. You should integrate the acquisition policy directly into your score function and recalibrate to restore finite-sample guarantees, enhancing model utility in real-world applications.
Key insights
BCEA improves LVLM reliability by acquiring more visual evidence under budget, restoring statistical guarantees through recalibration.
Principles
- Naive evidence acquisition breaks conformal guarantees.
- Recalibrating post-acquisition restores statistical guarantees.
- Structured, claim-type-specific interventions are effective.
Method
BCEA replaces binary answer/abstain with a three-way choice: answer, abstain, or acquire additional visual evidence (zoom, crop, claim-specific intervention) within a budget, then recalibrates post-acquisition scores.
In practice
- Implement three-way decision for LVLM outputs.
- Integrate evidence acquisition into score function.
- Use claim-type-specific visual interventions.
Topics
- Large Vision-Language Models
- Hallucination Control
- Conformal Prediction
- Evidence Acquisition
- Selective Prediction
- Computer Vision
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.