Deep Temporal Modeling and Ensemble Fusion for Multimodal Emotion Recognition from Physiological Signals

2026-06-12 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A study evaluated deep learning models, including Long Short-Term Memory (LSTM), Temporal Convolutional Networks (TCN), and Transformer architectures, for multimodal emotion recognition from physiological signals. Utilizing the WESAD dataset, the research focused on wrist and chest sensor data to assess affect recognition. Ablation studies determined individual modality contributions, training models on wrist-only and chest-only inputs. The work also explored both early fusion, by concatenating sensor signals, and a late-fusion ensemble strategy, combining predictions from all three architectures. Transformer models achieved the highest accuracy in multimodal settings, while TCNs excelled in wrist-only configurations. The late-fusion ensemble method ultimately yielded the highest overall accuracy of 98.91 +/- 0.13% and a macro-F1 score of 98.56 +/- 0.17%, demonstrating the efficacy of fusion techniques.

Key takeaway

For Machine Learning Engineers developing physiological emotion recognition systems, you should prioritize ensemble-based fusion strategies, specifically late-fusion, to achieve superior accuracy. If your application involves multimodal wrist and chest sensor data, consider Transformer models. For resource-constrained or wrist-only deployments, TCNs offer a strong alternative. Integrating these fusion and model choices will significantly enhance the robustness and performance of your affective computing solutions.

Key insights

Combining deep temporal models with sensor and ensemble fusion significantly boosts physiological emotion recognition accuracy.

Principles

Transformer models excel in multimodal physiological signal processing.
TCNs show strong performance for single-modality (wrist-only) physiological data.
Fusion strategies, both early and late, enhance emotion recognition robustness.

Method

The study evaluated LSTM, TCN, and Transformer models on WESAD using wrist/chest signals. It performed ablation studies, early fusion via concatenation, and a late-fusion ensemble of model predictions.

In practice

Implement late-fusion ensembles for peak physiological emotion recognition.
Consider TCNs for wrist-only physiological signal analysis.
Integrate multimodal sensor data for robust affect recognition systems.

Topics

Multimodal Emotion Recognition
Physiological Signals
Deep Learning Models
Ensemble Fusion
Temporal Convolutional Networks
Transformers
WESAD Dataset

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.