Why Vision Language Models Ignore What They See with Munawar Hayat - #758
Summary
Munawar Hayat, a researcher at Qualcomm AI Research, discussed several papers presented at NeurIPS 2025 focusing on multimodal and generative AI, particularly for efficient on-device deployment. Key challenges addressed include object hallucination in Vision-Language Models (VLMs), where models often prioritize language priors over visual information. His team developed an attention-guided alignment method to improve visual grounding and introduced a novel generalized contrastive learning approach for complex, composed retrieval tasks, such as combined text and image queries, without increasing inference costs. Additionally, they tackled difficulties in generative models rendering multiple human subjects, creating the "MultiHuman Testbench" to measure and mitigate issues like identity leakage and attribute blending. Qualcomm AI Research aims to advance AI capabilities for perception, reasoning, and action across devices, with a strong presence at NeurIPS, including 17 papers and nine demos.
Key takeaway
For Computer Vision Engineers developing multimodal AI for on-device deployment, you should prioritize techniques that enforce strong visual grounding and prevent object hallucination. Consider adopting generalized contrastive learning for robust cross-modal retrieval and explore attention-masking strategies to improve identity preservation in multi-person image generation, ensuring efficient and accurate mobile AI experiences.
Key insights
VLMs often ignore visual input, relying on language priors, necessitating improved visual grounding and multimodal alignment.
Principles
- Vision models require explicit visual grounding.
- Multimodal retrieval benefits from generalized contrastive learning.
- Multi-person generation needs identity preservation.
Method
Attention-guided alignment injects visual tokens hierarchically with an auxiliary loss. Generalized contrastive learning reformulates loss using combinations of image, text, and fused embeddings. Multi-Human Testbench uses attention masks to prevent identity leakage.
In practice
- Use attention-guided alignment to reduce VLM hallucination.
- Apply generalized contrastive learning for complex cross-modal search.
- Employ attention masks for multi-person image generation.
Topics
- Multimodal AI
- Generative AI
- Vision-Language Models
- Contrastive Learning
- On-Device AI
Best for: Computer Vision Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence).