A Comparative Study of Pretrained Transformer Models for Quranic ASR: Speech Representations, Label Formats, and Dataset Composition

2026-06-18 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A systematic empirical study evaluated domain-specific fine-tuning of pretrained Transformer-based models for Quranic Automatic Speech Recognition (ASR). The research utilized advanced speech feature extraction methods, specifically Wav2Vec2.0, HuBERT, and XLS-R, which employ self-supervised learning and Transformer architectures to learn context-aware speech features. These models were fine-tuned on a filtered Quranic dataset comprising over 870 hours of professional and user recitations. Comprehensive ablation studies across feature extractors, output label formats, training strategies, and clip durations identified critical factors influencing transcription accuracy. The top-performing configuration achieved a Word Error Rate (WER) of 0.08 on the EveryAyah subset and 0.11 on the combined EveryAyah+Tarteel setting. This represents approximately a five-percentage-point improvement over the Citrinet baseline (WER = 0.163) and reduced combined-model training time from 140 hours to 40 hours. Key findings include that Arabic text without diacritics yielded the best fine-tuning results, and Wav2Vec2-XLSR-53 provided the strongest overall representation.

Key takeaway

For Machine Learning Engineers developing ASR systems for specialized linguistic domains like Quranic recitation, you should prioritize fine-tuning pretrained Transformer models. Specifically, utilizing Wav2Vec2-XLSR-53 with diacritic-free Arabic text for labels can yield substantial accuracy improvements, achieving a WER of 0.08, and significantly reduce training time from 140 to 40 hours. This strategy offers a proven path to enhance performance and efficiency in challenging ASR applications, enabling more robust tools for aided memorization or search.

Key insights

Fine-tuning pretrained Transformers with specific speech representations and label formats significantly improves Quranic ASR accuracy and efficiency.

Principles

Pretrained Transformers excel in domain-specific ASR.
Diacritic-free Arabic text improves fine-tuning.
Wav2Vec2-XLSR-53 offers robust speech representation.

Method

Fine-tune Wav2Vec2.0, HuBERT, or XLS-R on >870 hours of domain-specific audio, optimizing label formats (diacritic-free Arabic) and training strategies.

In practice

Use Wav2Vec2-XLSR-53 for Quranic ASR.
Prioritize diacritic-free Arabic text for labels.
Filter large datasets for domain-specific fine-tuning.

Topics

Quranic ASR
Transformer Models
Speech Representations
Fine-tuning
Wav2Vec2-XLSR-53
Word Error Rate

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.