A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

2026-02-26 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new Mixture-of-Experts (MoE) framework, Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), addresses multimodal emotion recognition in conversations (ERC). MiSTER-E decouples modality-specific context modeling from multimodal information fusion, utilizing large language models (LLMs) fine-tuned for speech and text to generate utterance-level embeddings. These embeddings are further processed by a convolutional-recurrent context modeling layer. The system integrates predictions from three specialized experts (speech-only, text-only, cross-modal) via a learned gating mechanism. To ensure consistency, MiSTER-E incorporates a supervised contrastive loss for speech-text representations and a KL-divergence-based regularization across expert predictions. Notably, the model operates without relying on speaker identity. Evaluated on IEMOCAP, MELD, and MOSI datasets, MiSTER-E achieved weighted F1-scores of 70.9%, 69.5%, and 87.9% respectively, surpassing several baseline speech-text ERC systems.

Key takeaway

For research scientists developing multimodal emotion recognition systems, MiSTER-E offers a robust framework that avoids speaker identity reliance. You should consider its modular MoE architecture and the integration of supervised contrastive loss for improved cross-modal alignment. This approach could enhance the accuracy and generalizability of your ERC models, particularly in scenarios where explicit speaker information is unavailable or undesirable.

Key insights

MiSTER-E uses a Mixture-of-Experts framework to fuse speech and text for robust emotion recognition in conversations.

Principles

Decouple modality context from fusion
Integrate expert predictions dynamically
Enforce cross-modal consistency

Method

MiSTER-E uses LLM-derived speech/text embeddings, enhances them with a convolutional-recurrent layer, and fuses three expert predictions (speech, text, cross-modal) via a learned gating mechanism, regularized by contrastive and KL-divergence losses.

In practice

Apply LLMs for utterance embeddings
Use MoE for multimodal fusion
Implement contrastive loss for alignment

Topics

Multimodal Emotion Recognition
Mixture-of-Experts
Large Language Models
Conversational AI
Speech-Text Fusion

Best for: Research Scientist, AI Researcher, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.