IBM Releases Two Granite Speech 4.1 2B Models: Autoregressive ASR with Translation and Non-Autoregressive Editing for Fast Inference
Summary
IBM has released two new Granite Speech 4.1 2B models, both approximately 2 billion parameters and licensed under Apache 2.0. The autoregressive (AR) model offers multilingual Automatic Speech Recognition (ASR) and speech translation across six languages, including Japanese, achieving a 5.33 mean Word Error Rate (WER) on the Open ASR Leaderboard. The non-autoregressive (NAR) variant, Granite Speech 4.1 2B-NAR, focuses on fast inference, running at an RTFx of approximately 1820 on a single H100 GPU by editing CTC hypotheses in a single pass. Both models utilize a 16-layer Conformer encoder with dual-head CTC and a 2-layer Q-Former projector, fine-tuning a granite-4.0-1b-base as their language model backbone. A third variant, Granite Speech 4.1 2B-Plus, adds speaker-attributed ASR and word-level timestamps. These models were trained on 174,000 hours of audio and are natively supported in transformers>=4.52.1.
Key takeaway
For AI Engineers building speech applications, IBM's Granite Speech 4.1 2B models offer distinct advantages. If your priority is high-accuracy multilingual ASR or speech translation, choose the autoregressive model. For scenarios demanding extremely fast inference, such as real-time transcription, the non-autoregressive 2B-NAR variant is ideal. Evaluate your specific latency and language requirements to select the optimal model for your deployment.
Key insights
IBM's new Granite Speech 4.1 2B models offer both high-accuracy multilingual ASR/translation and extremely fast non-autoregressive inference.
Principles
- Autoregressive models prioritize accuracy and broader functionality.
- Non-autoregressive models excel in inference speed.
- Dual-head CTC improves encoder training for speech models.
Method
The models use a 16-layer Conformer encoder with dual-head CTC, a 2-layer Q-Former projector for audio downsampling, and a fine-tuned granite-4.0-1b-base LLM backbone.
In practice
- Use AR for multilingual ASR and speech translation.
- Opt for NAR for high-speed ASR inference.
- Consider 2B-Plus for speaker attribution and timestamps.
Topics
- IBM Granite Speech
- Autoregressive ASR
- Non-Autoregressive Inference
- Speech Translation
- Conformer Encoder
Best for: Machine Learning Engineer, NLP Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.