How Together AI built the world’s fastest speech-to-text stack

· Source: Together AI | The AI Native Cloud - Together.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, medium

Summary

Together AI has engineered the world's fastest speech-to-text stack, achieving remarkable latency results with NVIDIA's Parakeet-TDT 0.6B v3 and OpenAI's Whisper Large v3 models. Published on 5/29/2026, their optimized system allows Parakeet-TDT 0.6B v3 to transcribe approximately 20 hours of speech in under 10 seconds. This performance stems from a holistic approach addressing GPU execution, CPU preprocessing, memory movement, and runtime behavior. Key optimizations include using NVIDIA TensorRT multi-profile engines to compile the encoder for diverse audio shapes, implementing conditional NVIDIA CUDA graphs to remove CPU involvement from the decoder loop for a 2-3x speedup, and streamlining CPU paths with collapsed process boundaries, custom Unix domain sockets, and shared memory for zero-copy data transfer. Furthermore, evented I/O with epoll manages streaming connections efficiently, and gc.freeze() eliminates Python garbage collector-induced tail latency spikes of about 200 ms.

Key takeaway

For MLOps Engineers scaling voice AI in production, optimizing ASR latency requires a deep dive beyond model performance into the entire system stack. You should implement GPU-side control flow with conditional CUDA graphs and utilize TensorRT multi-profiles for encoders. Critically, address CPU-path inefficiencies with shared memory and epoll for streaming, and use gc.freeze() to eliminate Python GC-induced tail latency, ensuring predictable, low-latency user experiences.

Key insights

Optimizing ASR requires end-to-end system tuning, from GPU kernels to CPU data paths and runtime, to achieve minimal latency.

Principles

Method

Optimize ASR by compiling encoders for real audio shapes, moving decoder logic to GPU with conditional CUDA graphs, reducing CPU copies, using evented I/O, and freezing Python GC state.

In practice

Topics

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Together AI | The AI Native Cloud - Together.ai.