A Comparative Study of Pretrained Transformer Models for Quranic ASR: Speech Representations, Label Formats, and Dataset Composition
Summary
A systematic empirical study evaluated domain-specific fine-tuning of pretrained Transformer-based models for Quranic Automatic Speech Recognition (ASR). The research utilized advanced speech feature extraction methods, specifically Wav2Vec2.0, HuBERT, and XLS-R, which employ self-supervised learning and Transformer architectures to learn context-aware speech features. These models were fine-tuned on a filtered Quranic dataset comprising over 870 hours of professional and user recitations. Comprehensive ablation studies across feature extractors, output label formats, training strategies, and clip durations identified critical factors influencing transcription accuracy. The top-performing configuration achieved a Word Error Rate (WER) of 0.08 on the EveryAyah subset and 0.11 on the combined EveryAyah+Tarteel setting. This represents approximately a five-percentage-point improvement over the Citrinet baseline (WER = 0.163) and reduced combined-model training time from 140 hours to 40 hours. Key findings include that Arabic text without diacritics yielded the best fine-tuning results, and Wav2Vec2-XLSR-53 provided the strongest overall representation.
Key takeaway
For Machine Learning Engineers developing ASR systems for specialized linguistic domains like Quranic recitation, you should prioritize fine-tuning pretrained Transformer models. Specifically, utilizing Wav2Vec2-XLSR-53 with diacritic-free Arabic text for labels can yield substantial accuracy improvements, achieving a WER of 0.08, and significantly reduce training time from 140 to 40 hours. This strategy offers a proven path to enhance performance and efficiency in challenging ASR applications, enabling more robust tools for aided memorization or search.
Key insights
Fine-tuning pretrained Transformers with specific speech representations and label formats significantly improves Quranic ASR accuracy and efficiency.
Principles
- Pretrained Transformers excel in domain-specific ASR.
- Diacritic-free Arabic text improves fine-tuning.
- Wav2Vec2-XLSR-53 offers robust speech representation.
Method
Fine-tune Wav2Vec2.0, HuBERT, or XLS-R on >870 hours of domain-specific audio, optimizing label formats (diacritic-free Arabic) and training strategies.
In practice
- Use Wav2Vec2-XLSR-53 for Quranic ASR.
- Prioritize diacritic-free Arabic text for labels.
- Filter large datasets for domain-specific fine-tuning.
Topics
- Quranic ASR
- Transformer Models
- Speech Representations
- Fine-tuning
- Wav2Vec2-XLSR-53
- Word Error Rate
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.