Voxtral Realtime

2026-02-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

Voxtral Realtime is a new natively streaming automatic speech recognition (ASR) model that achieves offline transcription quality with sub-second latency. Unlike methods that adapt offline models, Voxtral Realtime is trained end-to-end for streaming, featuring explicit audio-text alignment. Its architecture builds on Delayed Streams Modeling, incorporating a new causal audio encoder and Ada RMS-Norm for improved delay conditioning. The model, with 4 billion parameters, was pretrained on a large dataset across 13 languages. At a 480ms delay, Voxtral Realtime performs comparably to Whisper, a widely used offline system, and ElevenLabs Scribe v2 Realtime. It surpasses these baselines at 960ms delay and approaches Voxtral Mini Transcribe V2's performance at 2400ms. The model weights are released under the Apache 2.0 license, and it integrates with the vLLM framework for efficient serving.

Key takeaway

For AI architects and NLP engineers developing real-time speech applications, Voxtral Realtime offers a robust, open-source solution that eliminates the traditional trade-off between transcription quality and latency. You should consider integrating this model into live captioning, voice assistant, or interactive speech interface projects, leveraging its multilingual support and sub-second performance to enhance user experience and operational efficiency.

Key insights

Voxtral Realtime delivers offline-quality ASR with sub-second latency via a natively streaming, end-to-end trained architecture.

Principles

Native streaming architectures require explicit audio-text alignment.
Adaptive RMS-Norm improves delay conditioning and convergence.
Word grouping in training preserves language model capabilities.

Method

Voxtral Realtime uses a causal audio encoder, an MLP adapter for downsampling, and a Transformer decoder with Ada RMS-Norm, trained with delay sampling and z-loss penalty.

In practice

Integrates with vLLM for efficient, low-latency serving.
Supports 13 languages for broad application.
Operates at tunable delays from 80ms to 2400ms.

Topics

Streaming ASR
Voxtral Realtime
Delayed Streams Modeling
Low-Latency Inference
Multilingual Speech Recognition

Best for: NLP Engineer, CTO, AI Architect, AI Researcher, AI Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.