A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations
Summary
A new Mixture-of-Experts (MoE) framework, Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), addresses multimodal emotion recognition in conversations (ERC). MiSTER-E decouples modality-specific context modeling from multimodal information fusion, utilizing large language models (LLMs) fine-tuned for speech and text to generate utterance-level embeddings. These embeddings are further processed by a convolutional-recurrent context modeling layer. The system integrates predictions from three specialized experts (speech-only, text-only, cross-modal) via a learned gating mechanism. To ensure consistency, MiSTER-E incorporates a supervised contrastive loss for speech-text representations and a KL-divergence-based regularization across expert predictions. Notably, the model operates without relying on speaker identity. Evaluated on IEMOCAP, MELD, and MOSI datasets, MiSTER-E achieved weighted F1-scores of 70.9%, 69.5%, and 87.9% respectively, surpassing several baseline speech-text ERC systems.
Key takeaway
For research scientists developing multimodal emotion recognition systems, MiSTER-E offers a robust framework that avoids speaker identity reliance. You should consider its modular MoE architecture and the integration of supervised contrastive loss for improved cross-modal alignment. This approach could enhance the accuracy and generalizability of your ERC models, particularly in scenarios where explicit speaker information is unavailable or undesirable.
Key insights
MiSTER-E uses a Mixture-of-Experts framework to fuse speech and text for robust emotion recognition in conversations.
Principles
- Decouple modality context from fusion
- Integrate expert predictions dynamically
- Enforce cross-modal consistency
Method
MiSTER-E uses LLM-derived speech/text embeddings, enhances them with a convolutional-recurrent layer, and fuses three expert predictions (speech, text, cross-modal) via a learned gating mechanism, regularized by contrastive and KL-divergence losses.
In practice
- Apply LLMs for utterance embeddings
- Use MoE for multimodal fusion
- Implement contrastive loss for alignment
Topics
- Multimodal Emotion Recognition
- Mixture-of-Experts
- Large Language Models
- Conversational AI
- Speech-Text Fusion
Best for: Research Scientist, AI Researcher, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.