Speech Emotion Recognition Leveraging OpenAI's Whisper Representations and Attentive Pooling Methods

2026-02-05 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, medium

Summary

Research by Parnia Izadirad, Mostafa Salehi, Mahmoud Bijankhan, and Ali Shendabadi explores using OpenAI's Whisper, a pre-trained ASR system, for Speech Emotion Recognition (SER). The study introduces two attention-based pooling methods, Multi-head Attentive Average Pooling and QKV Pooling, to reduce the dimensionality of Whisper representations while preserving emotional features. Experiments were conducted on English and Persian datasets, IEMOCAP and ShEMO, using Whisper Tiny and Small models. The multi-head QKV architecture achieved a 2.47% improvement in unweighted accuracy on the ShEMO dataset, setting a new state-of-the-art. The findings also indicate that intermediate Whisper encoder layers often perform better for SER on Persian data, offering a lightweight alternative to larger models like HuBERT X-Large, and highlight Whisper's potential as a feature extractor for SER.

Key takeaway

For AI Engineers developing Speech Emotion Recognition systems, this research suggests leveraging OpenAI's Whisper as a robust feature extractor. You should experiment with intermediate Whisper encoder layers and integrate attention-based pooling methods, particularly QKV Pooling, to achieve higher accuracy and more efficient models, especially for languages like Persian where it demonstrated significant gains.

Key insights

Whisper's pre-trained ASR representations, combined with attention-based pooling, significantly enhance Speech Emotion Recognition.

Principles

Intermediate encoder layers can outperform final layers for SER.
Attention-based pooling effectively reduces dimensionality.

Method

The proposed method uses Whisper representations with Multi-head Attentive Average Pooling or QKV Pooling to reduce dimensionality, then evaluates SER performance on English and Persian datasets.

In practice

Utilize Whisper's intermediate layers for SER tasks.
Apply QKV Pooling for improved SER accuracy.
Consider Whisper as a lightweight alternative to larger models.

Topics

Speech Emotion Recognition
Whisper Model
Attention Mechanisms
Pre-trained Models
Dimensionality Reduction

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.