How Together AI built the world’s fastest speech-to-text stack

2026-06-05 · Source: Together AI | The AI Native Cloud - Together.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, medium

Summary

Together AI has engineered the world's fastest speech-to-text stack, achieving remarkable latency results with NVIDIA's Parakeet-TDT 0.6B v3 and OpenAI's Whisper Large v3 models. Published on 5/29/2026, their optimized system allows Parakeet-TDT 0.6B v3 to transcribe approximately 20 hours of speech in under 10 seconds. This performance stems from a holistic approach addressing GPU execution, CPU preprocessing, memory movement, and runtime behavior. Key optimizations include using NVIDIA TensorRT multi-profile engines to compile the encoder for diverse audio shapes, implementing conditional NVIDIA CUDA graphs to remove CPU involvement from the decoder loop for a 2-3x speedup, and streamlining CPU paths with collapsed process boundaries, custom Unix domain sockets, and shared memory for zero-copy data transfer. Furthermore, evented I/O with epoll manages streaming connections efficiently, and gc.freeze() eliminates Python garbage collector-induced tail latency spikes of about 200 ms.

Key takeaway

For MLOps Engineers scaling voice AI in production, optimizing ASR latency requires a deep dive beyond model performance into the entire system stack. You should implement GPU-side control flow with conditional CUDA graphs and utilize TensorRT multi-profiles for encoders. Critically, address CPU-path inefficiencies with shared memory and epoll for streaming, and use gc.freeze() to eliminate Python GC-induced tail latency, ensuring predictable, low-latency user experiences.

Key insights

Optimizing ASR requires end-to-end system tuning, from GPU kernels to CPU data paths and runtime, to achieve minimal latency.

Principles

ASR optimization demands end-to-end system focus.
Tailor model compilation to real input shape distributions.
Eliminate CPU-GPU synchronization in hot loops.

Method

Optimize ASR by compiling encoders for real audio shapes, moving decoder logic to GPU with conditional CUDA graphs, reducing CPU copies, using evented I/O, and freezing Python GC state.

In practice

Use TensorRT multi-profile engines for varied inputs.
Implement conditional CUDA graphs for GPU-bound loops.
Apply gc.freeze() to prevent Python GC tail latency.

Topics

Speech-to-Text
ASR Optimization
TensorRT
CUDA Graphs
Python GC
Evented I/O
MLOps

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Together AI | The AI Native Cloud - Together.ai.