To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision, Speech Processing · Depth: Expert, quick

Summary

A query-adaptive framework is proposed for audio-visual person retrieval, addressing the challenge of absent modalities in real-world video archives. Unlike curated benchmarks, broadcast content often features individuals who are heard but unseen, or seen but unheard. Fixed multimodal fusion in such scenarios injects noise, degrading precision below unimodal systems. The framework detects active modalities using cross-modal score consistency: high agreement between modalities indicates both are active, while disagreement signals an absent modality. Classifiers leveraging these cross-modal features achieve 89% detection accuracy. Evaluated on the BBC Rewind corpus (over 12,000 videos), the adaptive system attained 94.2% P@1, surpassing speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle (96.6%).

Key takeaway

For machine learning engineers designing person retrieval systems for real-world, noisy video archives, you should avoid fixed multimodal fusion. Instead, implement query-adaptive modality detection to dynamically determine active modalities. This approach, leveraging cross-modal score consistency, will significantly improve your system's precision, as demonstrated by its superior performance over unimodal and fixed fusion methods on broadcast data.

Key insights

Query-adaptive modality detection significantly improves person retrieval by preventing noise from absent modalities.

Principles

Method

Classifiers driven by cross-modal features detect active modalities, achieving 89% accuracy.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.