Deepgram speech-to-text and voice models now available natively on Together AI

2026-04-22 · Source: Together AI | The AI Native Cloud - Together.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Robotics & Autonomous Systems · Depth: Intermediate, medium

Summary

Deepgram's production speech-to-text (STT) and text-to-speech (TTS) models, including Nova-3, Nova-3 Multilingual, Flux, and Aura-2, are now natively available on Together AI's Dedicated Model Inference platform as of April 2, 2026. This integration allows teams to deploy a complete real-time voice agent pipeline, combining Deepgram's transcription and synthesis with any Large Language Model (LLM) from Together AI's catalog on a single production surface. Key models include Flux, designed for conversational STT with 250ms end-of-turn detection to manage interruptions and turn-taking, and Nova-3, which offers production transcription for complex real-world audio with vocabulary customization. Aura-2 provides enterprise-grade TTS for clear and consistent voice agents. The platform offers dedicated GPU capacity, a 9% uptime SLA, SOC 2 Type II, HIPAA-ready support, and data residency options, streamlining operations for use cases like contact centers, healthcare, and financial services.

Key takeaway

For MLOps Engineers building real-time voice agents, integrating Deepgram's STT and TTS models on Together AI simplifies your production stack. You can now run transcription, LLM reasoning, and synthesis on a single platform, significantly reducing latency and operational fragility often caused by multi-vendor setups. This unified approach, with features like 250ms end-of-turn detection and vocabulary customization, helps you deliver more natural and reliable conversational experiences, especially in regulated environments requiring SOC 2 Type II or HIPAA compliance.

Key insights

Real-time voice agents require integrated STT, LLM, and TTS on a single platform to minimize latency and operational complexity.

Principles

Conversational STT needs turn detection, not just transcription.
Production audio demands robust models for noise and accents.
Enterprise TTS requires clarity for structured information.

Method

Deploy Deepgram STT/TTS models (Flux, Nova-3, Aura-2) alongside LLMs on Together AI's Dedicated Model Inference for a unified voice pipeline.

In practice

Use Flux for conversational turn-taking in voice agents.
Customize Nova-3 vocabulary for domain-specific terms.
Leverage Aura-2 for consistent, clear patient-facing output.

Topics

Speech-to-Text
Text-to-Speech
Real-time Voice Agents
Together AI
Deepgram
Conversational AI
MLOps Infrastructure

Best for: Machine Learning Engineer, CTO, VP of Engineering/Data, AI Engineer, MLOps Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Together AI | The AI Native Cloud - Together.ai.