Multimodal Hidden Markov Models for Persistent Emotional State Tracking

2026-05-15 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

A new framework proposes tracking persistent emotional states in conversations using Multimodal Hidden Markov Models (HMMs) to improve understanding and guidance in clinical contexts. This lightweight approach models conversational emotion as a sequence of latent emotional regimes, leveraging sticky factorial HDP-HMMs over multimodal valence-arousal (VA) representations derived from simultaneous video, audio, and textual input. The sticky HDP-HMM significantly outperforms baseline Gaussian HMMs in producing interpretable regime sequences, reducing single-utterance regimes by nearly an order of magnitude and regime shifts by a factor of four, while achieving longer mean regime durations. The framework also demonstrates computational efficiency compared to LLM-based dialogue state tracking methods. Furthermore, augmenting LLM responses in a clinical Question-Answer setting with these interpretable emotional phases improves response quality, particularly in unstable affective regimes, without relying on expensive runtime LLM inference.

Key takeaway

For research scientists developing conversational AI for clinical or sensitive applications, you should consider integrating sticky factorial HDP-HMMs for emotional state tracking. This approach provides more stable and interpretable emotional regimes than traditional HMMs, especially when dealing with multimodal data. By augmenting your LLM's context with these inferred regimes, you can significantly improve response quality in emotionally unstable conversational segments, leading to more attuned and clinically appropriate interactions without incurring high computational costs.

Key insights

Sticky factorial HDP-HMMs effectively track persistent emotional regimes in multimodal conversations, outperforming standard HMMs and enhancing LLM responses.

Principles

Emotional states in conversations are persistent, not fleeting.
Multimodal input improves robustness of emotion recognition.
Temporal regularization is crucial for meaningful regime detection.

Method

The method uses a truncated sticky factorial HDP-HMM on multimodal valence-arousal representations from text (DistilBERT), audio (Wav2Vec 2.0), and video (EmoNet). It employs a factorized emission model for conditional independence across modalities and infers the effective number of states.

In practice

Use sticky HDP-HMMs for robust emotional state tracking.
Augment LLM context with emotional regime summaries.
Combine text, audio, and video for comprehensive affect analysis.

Topics

Multimodal Hidden Markov Models
Emotional State Tracking
Valence-Arousal Representations
Sticky HDP-HMMs
LLM-as-a-Judge Evaluation

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.