Why Vision Language Models Ignore What They See [Munawar Hayat] - 758
Summary
Qualcomm AI Research presented 17 papers and 9 demos at the recent NeurIPS conference, focusing on multimodal generative AI, visual understanding, and cross-modal retrieval. Key research areas include addressing physical property inconsistencies in physics-based image generation, tackling hallucination in Vision-Language Models (VLMs) through attention-guided alignment, and improving multimodal retrieval with Generalized Contrastive Learning (GCL). The research also introduced a Multi-Human Test Bench to raise the bar for multi-person image generation, aiming to preserve facial identity and count accuracy. Qualcomm's efforts emphasize efficient on-device AI, with demonstrations like mobile diffusion transformers generating 48 frames in under 8 seconds on a mobile phone, and single-step diffusion-based image editing.
Key takeaway
For AI Scientists and Research Scientists developing multimodal AI, recognize that current VLM limitations in physics-based generation and visual grounding require novel approaches beyond simply scaling data. Focus on integrating visual information more deeply into language models through techniques like attention-guided alignment and specialized loss functions, and consider the proposed Multi-Human Test Bench for evaluating multi-person image generation to ensure robust identity preservation and count accuracy.
Key insights
Current VLMs struggle with physics-based generation, visual grounding, and multi-person image fidelity, necessitating improved training and architectural approaches.
Principles
- Visual information often gets ignored in VLMs.
- Physics-based understanding is crucial for real-world AI.
- Attention mechanisms can enhance visual grounding.
Method
Qualcomm's "Attention Guided Alignment" injects visual tokens at hierarchical levels of the language model via cross-attention modules and uses an auxiliary loss based on segmentation masks to maximize attention to salient visual regions, improving VLM grounding.
In practice
- Expand image descriptions with physics information during VLM training.
- Use cross-attention for efficient visual token injection.
- Employ attention masks to prevent identity leakage in multi-person generation.
Topics
- Multimodal AI
- Vision-Language Models
- Physics-Based Generation
- Contrastive Learning
- AI Efficiency
Best for: AI Scientist, Research Scientist, AI Researcher, Deep Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The TWIML AI Podcast with Sam Charrington.