MMTM: Tri-Modal Topic Modeling for Long-Form Video via Similarity-Gated Fusion

2026-05-28 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

MMTM introduces a modular pipeline for topic discovery in long-form video, integrating speech recognition, audio and visual embeddings, and BERTopic clustering through a deterministic similarity-gated fusion. Evaluated cross-lingually on German (Tagesschau) and English (NBC) broadcast news, this joint tri-modal modeling substantially improves topic quality. Noise drops from 0.27 to 0.06, transition rate from 0.70 to 0.21, and normalized entropy rises from 0.84 to 0.92, indicating more coherent and temporally stable topics. Cluster validity (Calinski-Harabasz) improves by 5-12X across embedding spaces. Lexical coherence (NPMI) rises from 0.77 to 0.86 on German but is corpus-dependent and does not transfer to the shorter NBC broadcasts. The pipeline code and a human-validated 54-hour multimodal video topic corpus are released.

Key takeaway

For Machine Learning Engineers developing video analysis systems, MMTM offers a robust approach to topic modeling. Its tri-modal fusion significantly improves topic coherence and temporal stability, reducing noise and transition rates. You should consider integrating this similarity-gated pipeline to enhance the quality of your long-form video topic discovery, especially for broadcast news content. The released code and corpus provide valuable resources for implementation and evaluation.

Key insights

MMTM's tri-modal fusion significantly enhances long-form video topic discovery by integrating speech, audio, and visual data.

Principles

Joint tri-modal modeling improves topic quality.
Similarity-gated fusion enhances topic coherence.
Corpus characteristics affect lexical coherence.

Method

MMTM integrates speech recognition, audio/visual embeddings, and BERTopic clustering via deterministic similarity-gated fusion for topic discovery in long-form video.

In practice

Apply MMTM for coherent topic discovery in long-form video.
Utilize the released 54-hour video topic corpus.
Consider corpus length when evaluating lexical coherence.

Topics

MMTM
Topic Modeling
Long-Form Video
Multimodal Fusion
BERTopic
Video Analysis

Best for: Research Scientist, NLP Engineer, Computer Vision Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.