IBM Releases Two Granite Speech 4.1 2B Models: Autoregressive ASR with Translation and Non-Autoregressive Editing for Fast Inference

2026-04-30 · Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Intermediate, quick

Summary

IBM has released two new Granite Speech 4.1 2B models, both approximately 2 billion parameters and licensed under Apache 2.0. The autoregressive (AR) model offers multilingual Automatic Speech Recognition (ASR) and speech translation across six languages, including Japanese, achieving a 5.33 mean Word Error Rate (WER) on the Open ASR Leaderboard. The non-autoregressive (NAR) variant, Granite Speech 4.1 2B-NAR, focuses on fast inference, running at an RTFx of approximately 1820 on a single H100 GPU by editing CTC hypotheses in a single pass. Both models utilize a 16-layer Conformer encoder with dual-head CTC and a 2-layer Q-Former projector, fine-tuning a granite-4.0-1b-base as their language model backbone. A third variant, Granite Speech 4.1 2B-Plus, adds speaker-attributed ASR and word-level timestamps. These models were trained on 174,000 hours of audio and are natively supported in transformers>=4.52.1.

Key takeaway

For AI Engineers building speech applications, IBM's Granite Speech 4.1 2B models offer distinct advantages. If your priority is high-accuracy multilingual ASR or speech translation, choose the autoregressive model. For scenarios demanding extremely fast inference, such as real-time transcription, the non-autoregressive 2B-NAR variant is ideal. Evaluate your specific latency and language requirements to select the optimal model for your deployment.

Key insights

IBM's new Granite Speech 4.1 2B models offer both high-accuracy multilingual ASR/translation and extremely fast non-autoregressive inference.

Principles

Autoregressive models prioritize accuracy and broader functionality.
Non-autoregressive models excel in inference speed.
Dual-head CTC improves encoder training for speech models.

Method

The models use a 16-layer Conformer encoder with dual-head CTC, a 2-layer Q-Former projector for audio downsampling, and a fine-tuned granite-4.0-1b-base LLM backbone.

In practice

Use AR for multilingual ASR and speech translation.
Opt for NAR for high-speed ASR inference.
Consider 2B-Plus for speaker attribution and timestamps.

Topics

IBM Granite Speech
Autoregressive ASR
Non-Autoregressive Inference
Speech Translation
Conformer Encoder

Best for: Machine Learning Engineer, NLP Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.