How Together AI built the world’s fastest speech-to-text stack
Summary
Together AI has engineered the world's fastest speech-to-text stack, achieving remarkable latency results with NVIDIA's Parakeet-TDT 0.6B v3 and OpenAI's Whisper Large v3 models. Published on 5/29/2026, their optimized system allows Parakeet-TDT 0.6B v3 to transcribe approximately 20 hours of speech in under 10 seconds. This performance stems from a holistic approach addressing GPU execution, CPU preprocessing, memory movement, and runtime behavior. Key optimizations include using NVIDIA TensorRT multi-profile engines to compile the encoder for diverse audio shapes, implementing conditional NVIDIA CUDA graphs to remove CPU involvement from the decoder loop for a 2-3x speedup, and streamlining CPU paths with collapsed process boundaries, custom Unix domain sockets, and shared memory for zero-copy data transfer. Furthermore, evented I/O with epoll manages streaming connections efficiently, and gc.freeze() eliminates Python garbage collector-induced tail latency spikes of about 200 ms.
Key takeaway
For MLOps Engineers scaling voice AI in production, optimizing ASR latency requires a deep dive beyond model performance into the entire system stack. You should implement GPU-side control flow with conditional CUDA graphs and utilize TensorRT multi-profiles for encoders. Critically, address CPU-path inefficiencies with shared memory and epoll for streaming, and use gc.freeze() to eliminate Python GC-induced tail latency, ensuring predictable, low-latency user experiences.
Key insights
Optimizing ASR requires end-to-end system tuning, from GPU kernels to CPU data paths and runtime, to achieve minimal latency.
Principles
- ASR optimization demands end-to-end system focus.
- Tailor model compilation to real input shape distributions.
- Eliminate CPU-GPU synchronization in hot loops.
Method
Optimize ASR by compiling encoders for real audio shapes, moving decoder logic to GPU with conditional CUDA graphs, reducing CPU copies, using evented I/O, and freezing Python GC state.
In practice
- Use TensorRT multi-profile engines for varied inputs.
- Implement conditional CUDA graphs for GPU-bound loops.
- Apply gc.freeze() to prevent Python GC tail latency.
Topics
- Speech-to-Text
- ASR Optimization
- TensorRT
- CUDA Graphs
- Python GC
- Evented I/O
- MLOps
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Together AI | The AI Native Cloud - Together.ai.