Hybrid CNN-Transformer Architecture for Arabic Speech Emotion Recognition

2026-04-10 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Speech Technology · Depth: Expert, long

Summary

Researchers at the University of Science and Technology of Oran - Mohamed Boudiaf (USTO-MB) developed a hybrid CNN–Transformer architecture for Arabic Speech Emotion Recognition (SER). This model addresses the scarcity of research in Arabic SER due to limited annotated datasets, integrating convolutional layers for spectral feature extraction from Mel-spectrograms and Transformer encoders for long-range temporal dependencies. Evaluated on the EYASE (Egyptian Arabic speech emotion) corpus, the proposed system achieved 97.8% accuracy and a macro F1-score of 0.98. These results significantly outperform traditional classifiers and CNN-only baselines, demonstrating the effectiveness of combining convolutional feature extraction with attention-based modeling for low-resource languages like Arabic.

Key takeaway

For research scientists developing speech emotion recognition systems for low-resource languages, this CNN–Transformer hybrid architecture offers a robust benchmark. You should consider integrating convolutional layers for local feature extraction with Transformer encoders for global temporal modeling, especially when working with limited datasets. This approach demonstrates superior performance over traditional methods and can guide future efforts in cross-dialectal generalization.

Key insights

A hybrid CNN–Transformer model significantly improves Arabic Speech Emotion Recognition by combining local spectral and global temporal feature learning.

Principles

Hybrid architectures excel in complex signal processing.
Mel-spectrograms are effective for deep learning in speech.
Attention mechanisms capture long-range dependencies.

Method

The method involves preprocessing audio to Mel-spectrograms, using CNNs for local spectral features, Transformer encoders for global temporal context via self-attention, and a final classification layer with softmax activation.

In practice

Use Mel-spectrograms as input for speech tasks.
Combine CNNs and Transformers for robust feature learning.
Apply data augmentation to improve model generalization.

Topics

Arabic Speech Emotion Recognition
Hybrid CNN-Transformer Architecture
Mel-spectrogram Features
Transformer Encoders
Self-Attention Mechanism

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.