To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection
Summary
A query-adaptive framework is proposed for audio-visual person retrieval, addressing the challenge of absent modalities in real-world video archives. Unlike curated benchmarks, broadcast content often features individuals who are heard but unseen, or seen but unheard. Fixed multimodal fusion in such scenarios injects noise, degrading precision below unimodal systems. The framework detects active modalities using cross-modal score consistency: high agreement between modalities indicates both are active, while disagreement signals an absent modality. Classifiers leveraging these cross-modal features achieve 89% detection accuracy. Evaluated on the BBC Rewind corpus (over 12,000 videos), the adaptive system attained 94.2% P@1, surpassing speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle (96.6%).
Key takeaway
For machine learning engineers designing person retrieval systems for real-world, noisy video archives, you should avoid fixed multimodal fusion. Instead, implement query-adaptive modality detection to dynamically determine active modalities. This approach, leveraging cross-modal score consistency, will significantly improve your system's precision, as demonstrated by its superior performance over unimodal and fixed fusion methods on broadcast data.
Key insights
Query-adaptive modality detection significantly improves person retrieval by preventing noise from absent modalities.
Principles
- Fusing scores from an absent modality degrades retrieval precision.
- Cross-modal score consistency indicates active modality presence.
Method
Classifiers driven by cross-modal features detect active modalities, achieving 89% accuracy.
In practice
- Implement cross-modal score consistency checks for robust multimodal retrieval.
- Prioritize unimodal systems over fixed fusion when modality presence is uncertain.
Topics
- Audio-Visual Person Retrieval
- Multimodal Fusion
- Active Modality Detection
- Cross-modal Consistency
- Video Archives
- BBC Rewind
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.