To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, long

Summary

The "Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection" framework addresses the challenge of multimodal person retrieval in uncurated video archives, where a target person might be seen, heard, or both. Unlike curated benchmarks, real-world data like the BBC Rewind corpus (12,594 videos) often contains audio-only (AoP) or visual-only (VoP) presence, alongside audio-visual (AVP). The framework proposes detecting active modalities using cross-modal score consistency, where agreement between retrieval sets indicates active modalities. Classifiers, including decision trees, achieve 89% detection accuracy on a query set of 523 video files from 38 politicians. This adaptive system attains 94.2% P@1 on BBC Rewind, significantly outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%) approaches. It recovers 64% of the performance gap to an oracle with ground-truth modality labels (96.6% P@1).

Key takeaway

For Machine Learning Engineers building person retrieval systems for uncurated video archives, you should implement query-adaptive modality detection. Blindly fusing audio and visual modalities can degrade P@1 performance below unimodal systems, as seen with fixed fusion at 90.0% versus face-only at 93.4%. By utilizing cross-modal score consistency, your system can achieve 94.2% P@1, significantly improving accuracy and recovering 64% of the gap to an oracle.

Key insights

Cross-modal score consistency reliably detects active modalities, preventing noise from absent modalities in multimodal retrieval.

Principles

Fusing scores from an absent modality degrades precision.
Active modalities produce peaked score distributions.
Cross-modal agreement signals active modalities.

Method

Detect active modalities by analyzing within-modal and cross-modal cosine similarity score distributions from top-n retrieved files. Classify query type (AoP, VoP, AVP) using these features, then adapt fusion weights accordingly.

In practice

Implement cross-modal score consistency for robust fusion.
Use Decision Trees for high modality detection accuracy.
Prioritize face embeddings for noisy archival video.

Topics

Multimodal Retrieval
Active Modality Detection
Audio-Visual Fusion
Speaker Embeddings
Face Embeddings
BBC Rewind Corpus

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.