Multimodal Hidden Markov Models for Persistent Emotional State Tracking
Summary
A new framework proposes tracking persistent emotional states in conversations using Multimodal Hidden Markov Models (HMMs) to improve understanding and guidance in clinical contexts. This lightweight approach models conversational emotion as a sequence of latent emotional regimes, leveraging sticky factorial HDP-HMMs over multimodal valence-arousal (VA) representations derived from simultaneous video, audio, and textual input. The sticky HDP-HMM significantly outperforms baseline Gaussian HMMs in producing interpretable regime sequences, reducing single-utterance regimes by nearly an order of magnitude and regime shifts by a factor of four, while achieving longer mean regime durations. The framework also demonstrates computational efficiency compared to LLM-based dialogue state tracking methods. Furthermore, augmenting LLM responses in a clinical Question-Answer setting with these interpretable emotional phases improves response quality, particularly in unstable affective regimes, without relying on expensive runtime LLM inference.
Key takeaway
For research scientists developing conversational AI for clinical or sensitive applications, you should consider integrating sticky factorial HDP-HMMs for emotional state tracking. This approach provides more stable and interpretable emotional regimes than traditional HMMs, especially when dealing with multimodal data. By augmenting your LLM's context with these inferred regimes, you can significantly improve response quality in emotionally unstable conversational segments, leading to more attuned and clinically appropriate interactions without incurring high computational costs.
Key insights
Sticky factorial HDP-HMMs effectively track persistent emotional regimes in multimodal conversations, outperforming standard HMMs and enhancing LLM responses.
Principles
- Emotional states in conversations are persistent, not fleeting.
- Multimodal input improves robustness of emotion recognition.
- Temporal regularization is crucial for meaningful regime detection.
Method
The method uses a truncated sticky factorial HDP-HMM on multimodal valence-arousal representations from text (DistilBERT), audio (Wav2Vec 2.0), and video (EmoNet). It employs a factorized emission model for conditional independence across modalities and infers the effective number of states.
In practice
- Use sticky HDP-HMMs for robust emotional state tracking.
- Augment LLM context with emotional regime summaries.
- Combine text, audio, and video for comprehensive affect analysis.
Topics
- Multimodal Hidden Markov Models
- Emotional State Tracking
- Valence-Arousal Representations
- Sticky HDP-HMMs
- LLM-as-a-Judge Evaluation
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.