Hybrid CNN-Transformer Architecture for Arabic Speech Emotion Recognition

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Speech Technology · Depth: Expert, long

Summary

Researchers at the University of Science and Technology of Oran - Mohamed Boudiaf (USTO-MB) developed a hybrid CNN–Transformer architecture for Arabic Speech Emotion Recognition (SER). This model addresses the scarcity of research in Arabic SER due to limited annotated datasets, integrating convolutional layers for spectral feature extraction from Mel-spectrograms and Transformer encoders for long-range temporal dependencies. Evaluated on the EYASE (Egyptian Arabic speech emotion) corpus, the proposed system achieved 97.8% accuracy and a macro F1-score of 0.98. These results significantly outperform traditional classifiers and CNN-only baselines, demonstrating the effectiveness of combining convolutional feature extraction with attention-based modeling for low-resource languages like Arabic.

Key takeaway

For research scientists developing speech emotion recognition systems for low-resource languages, this CNN–Transformer hybrid architecture offers a robust benchmark. You should consider integrating convolutional layers for local feature extraction with Transformer encoders for global temporal modeling, especially when working with limited datasets. This approach demonstrates superior performance over traditional methods and can guide future efforts in cross-dialectal generalization.

Key insights

A hybrid CNN–Transformer model significantly improves Arabic Speech Emotion Recognition by combining local spectral and global temporal feature learning.

Principles

Method

The method involves preprocessing audio to Mel-spectrograms, using CNNs for local spectral features, Transformer encoders for global temporal context via self-attention, and a final classification layer with softmax activation.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.