Dynamic Fusion-Aware Graph Convolutional Neural Network for Multimodal Emotion Recognition in Conversations
Summary
The paper introduces a Dynamic Fusion-aware Graph Convolutional Neural Network (DF-GCN) for multimodal emotion recognition in conversations (MERC). DF-GCN addresses the limitation of existing GCN-based methods that use fixed parameters for multimodal feature fusion across different emotion types, which often compromises performance on specific emotions. The proposed model integrates ordinary differential equations (ODEs) into GCNs to capture dynamic emotional dependencies within utterance interaction networks. It also leverages prompts generated by a global information vector (GIV) to guide the dynamic fusion of multimodal features, allowing for adaptive parameter changes during inference for different emotion categories. Experiments on the IEMOCAP and MELD datasets demonstrate that DF-GCN achieves superior performance, particularly in weighted accuracy (WA) and weighted F1 (WF1) scores, outperforming existing mainstream methods while maintaining comparable computational efficiency.
Key takeaway
Research Scientists developing multimodal emotion recognition systems should consider integrating dynamic fusion mechanisms, such as those in DF-GCN, to overcome the limitations of static parameter models. Your models can achieve more flexible and accurate emotion classification by allowing network parameters to adapt to different emotion categories during inference, significantly enhancing performance on challenging datasets like IEMOCAP and MELD, especially for minority emotion classes.
Key insights
Dynamic fusion of multimodal features via ODE-integrated GCNs improves conversational emotion recognition.
Principles
- Emotional states evolve continuously, not discretely.
- Global context guides adaptive multimodal feature fusion.
- Dynamic parameters enhance model generalization.
Method
DF-GCN uses a Static Graph Convolution (SGCODE) block and a Dynamic Graph Convolution (DGCODE) block with ODEs. It generates a Global Information Vector (GIV) via Transformer and global average pooling, then uses a Prompt Generation Network (PGN) to create dynamic weights for DGCODE's adaptive fusion.
In practice
- Use RoBERTa, OpenSMILE, DenseNet for initial feature encoding.
- Employ Bi-GRU for text context, FC networks for audio/video.
- Construct emotional interaction graphs using cosine similarity.
Topics
- Multimodal Emotion Recognition
- Graph Convolutional Networks
- Dynamic Fusion
- Neural Ordinary Differential Equations
- Prompt Learning
Code references
Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.