To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, long

Summary

The "Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection" framework addresses the challenge of multimodal person retrieval in uncurated video archives, where a target person might be seen, heard, or both. Unlike curated benchmarks, real-world data like the BBC Rewind corpus (12,594 videos) often contains audio-only (AoP) or visual-only (VoP) presence, alongside audio-visual (AVP). The framework proposes detecting active modalities using cross-modal score consistency, where agreement between retrieval sets indicates active modalities. Classifiers, including decision trees, achieve 89% detection accuracy on a query set of 523 video files from 38 politicians. This adaptive system attains 94.2% P@1 on BBC Rewind, significantly outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%) approaches. It recovers 64% of the performance gap to an oracle with ground-truth modality labels (96.6% P@1).

Key takeaway

For Machine Learning Engineers building person retrieval systems for uncurated video archives, you should implement query-adaptive modality detection. Blindly fusing audio and visual modalities can degrade P@1 performance below unimodal systems, as seen with fixed fusion at 90.0% versus face-only at 93.4%. By utilizing cross-modal score consistency, your system can achieve 94.2% P@1, significantly improving accuracy and recovering 64% of the gap to an oracle.

Key insights

Cross-modal score consistency reliably detects active modalities, preventing noise from absent modalities in multimodal retrieval.

Principles

Method

Detect active modalities by analyzing within-modal and cross-modal cosine similarity score distributions from top-n retrieved files. Classify query type (AoP, VoP, AVP) using these features, then adapt fusion weights accordingly.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.