Why Vision Language Models Ignore What They See [Munawar Hayat] - 758

· Source: The TWIML AI Podcast with Sam Charrington · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision, Natural Language Processing · Depth: Advanced, extended

Summary

Qualcomm AI Research presented 17 papers and 9 demos at the recent NeurIPS conference, focusing on multimodal generative AI, visual understanding, and cross-modal retrieval. Key research areas include addressing physical property inconsistencies in physics-based image generation, tackling hallucination in Vision-Language Models (VLMs) through attention-guided alignment, and improving multimodal retrieval with Generalized Contrastive Learning (GCL). The research also introduced a Multi-Human Test Bench to raise the bar for multi-person image generation, aiming to preserve facial identity and count accuracy. Qualcomm's efforts emphasize efficient on-device AI, with demonstrations like mobile diffusion transformers generating 48 frames in under 8 seconds on a mobile phone, and single-step diffusion-based image editing.

Key takeaway

For AI Scientists and Research Scientists developing multimodal AI, recognize that current VLM limitations in physics-based generation and visual grounding require novel approaches beyond simply scaling data. Focus on integrating visual information more deeply into language models through techniques like attention-guided alignment and specialized loss functions, and consider the proposed Multi-Human Test Bench for evaluating multi-person image generation to ensure robust identity preservation and count accuracy.

Key insights

Current VLMs struggle with physics-based generation, visual grounding, and multi-person image fidelity, necessitating improved training and architectural approaches.

Principles

Method

Qualcomm's "Attention Guided Alignment" injects visual tokens at hierarchical levels of the language model via cross-attention modules and uses an auxiliary loss based on segmentation masks to maximize attention to salient visual regions, improving VLM grounding.

In practice

Topics

Best for: AI Scientist, Research Scientist, AI Researcher, Deep Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The TWIML AI Podcast with Sam Charrington.