Speech Emotion Recognition Leveraging OpenAI's Whisper Representations and Attentive Pooling Methods
Summary
Research by Parnia Izadirad, Mostafa Salehi, Mahmoud Bijankhan, and Ali Shendabadi explores using OpenAI's Whisper, a pre-trained ASR system, for Speech Emotion Recognition (SER). The study introduces two attention-based pooling methods, Multi-head Attentive Average Pooling and QKV Pooling, to reduce the dimensionality of Whisper representations while preserving emotional features. Experiments were conducted on English and Persian datasets, IEMOCAP and ShEMO, using Whisper Tiny and Small models. The multi-head QKV architecture achieved a 2.47% improvement in unweighted accuracy on the ShEMO dataset, setting a new state-of-the-art. The findings also indicate that intermediate Whisper encoder layers often perform better for SER on Persian data, offering a lightweight alternative to larger models like HuBERT X-Large, and highlight Whisper's potential as a feature extractor for SER.
Key takeaway
For AI Engineers developing Speech Emotion Recognition systems, this research suggests leveraging OpenAI's Whisper as a robust feature extractor. You should experiment with intermediate Whisper encoder layers and integrate attention-based pooling methods, particularly QKV Pooling, to achieve higher accuracy and more efficient models, especially for languages like Persian where it demonstrated significant gains.
Key insights
Whisper's pre-trained ASR representations, combined with attention-based pooling, significantly enhance Speech Emotion Recognition.
Principles
- Intermediate encoder layers can outperform final layers for SER.
- Attention-based pooling effectively reduces dimensionality.
Method
The proposed method uses Whisper representations with Multi-head Attentive Average Pooling or QKV Pooling to reduce dimensionality, then evaluates SER performance on English and Persian datasets.
In practice
- Utilize Whisper's intermediate layers for SER tasks.
- Apply QKV Pooling for improved SER accuracy.
- Consider Whisper as a lightweight alternative to larger models.
Topics
- Speech Emotion Recognition
- Whisper Model
- Attention Mechanisms
- Pre-trained Models
- Dimensionality Reduction
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.