MMTM: Tri-Modal Topic Modeling for Long-Form Video via Similarity-Gated Fusion
Summary
MMTM introduces a modular pipeline for topic discovery in long-form video, integrating speech recognition, audio and visual embeddings, and BERTopic clustering through a deterministic similarity-gated fusion. Evaluated cross-lingually on German (Tagesschau) and English (NBC) broadcast news, this joint tri-modal modeling substantially improves topic quality. Noise drops from 0.27 to 0.06, transition rate from 0.70 to 0.21, and normalized entropy rises from 0.84 to 0.92, indicating more coherent and temporally stable topics. Cluster validity (Calinski-Harabasz) improves by 5-12X across embedding spaces. Lexical coherence (NPMI) rises from 0.77 to 0.86 on German but is corpus-dependent and does not transfer to the shorter NBC broadcasts. The pipeline code and a human-validated 54-hour multimodal video topic corpus are released.
Key takeaway
For Machine Learning Engineers developing video analysis systems, MMTM offers a robust approach to topic modeling. Its tri-modal fusion significantly improves topic coherence and temporal stability, reducing noise and transition rates. You should consider integrating this similarity-gated pipeline to enhance the quality of your long-form video topic discovery, especially for broadcast news content. The released code and corpus provide valuable resources for implementation and evaluation.
Key insights
MMTM's tri-modal fusion significantly enhances long-form video topic discovery by integrating speech, audio, and visual data.
Principles
- Joint tri-modal modeling improves topic quality.
- Similarity-gated fusion enhances topic coherence.
- Corpus characteristics affect lexical coherence.
Method
MMTM integrates speech recognition, audio/visual embeddings, and BERTopic clustering via deterministic similarity-gated fusion for topic discovery in long-form video.
In practice
- Apply MMTM for coherent topic discovery in long-form video.
- Utilize the released 54-hour video topic corpus.
- Consider corpus length when evaluating lexical coherence.
Topics
- MMTM
- Topic Modeling
- Long-Form Video
- Multimodal Fusion
- BERTopic
- Video Analysis
Best for: Research Scientist, NLP Engineer, Computer Vision Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.