[AINews] Thinking Machines' Native Interaction Models - TML-Interaction-Small 276B-A12B - advances SOTA Realtime Voice and kills standard VAD
Summary
Thinking Machines has released "Interaction Models: A Scalable Approach to Human-AI Collaboration," featuring TML-Interaction-Small, a 276B parameter Mixture-of-Experts model with 12B active parameters. This model significantly advances real-time voice interaction, moving beyond turn-based AI by processing images and audio in under 200ms using encoder-free early fusion, similar to Meta's Chameleon. The system demonstrates continuous interactivity with "time-aligned microturns" of 200ms. Thinking Machines also introduced new internal benchmarks like TimeSpeak, CueSpeak, RepCount-A, and ProactiveVideoQA to measure time awareness, simultaneous translation, and visual proactivity, outperforming existing models like GPT-Realtime-2 and Gemini 3.1-Flash on various tasks. OpenAI also announced the OpenAI Deployment Company, investing $4B with 19 partners to embed 150 Forward Deployed Engineers, and launched Daybreak for security-specific model distribution and cyber operations.
Key takeaway
For AI Engineers and CTOs building interactive AI systems, prioritize models designed for native, full-duplex multimodal interaction rather than layering capabilities onto turn-based LLMs. Your focus should shift to systems that can concurrently process audio, video, and text with sub-200ms latency, enabling continuous human-AI bandwidth. Evaluate solutions like Thinking Machines' approach or OpenAI's Daybreak for robust deployment and security, and consider local-first agent tooling for greater control and efficiency.
Key insights
Real-time, full-duplex multimodal interaction models are advancing human-AI collaboration beyond turn-based systems.
Principles
- Models should be trained from scratch for real-time interaction.
- Human-AI bandwidth requires concurrent listening, speaking, watching, thinking, and reacting.
- Native interactivity is a deeper innovation than raw benchmark claims.
Method
Thinking Machines uses encoder-free early fusion to process images and audio in under 200ms, focusing on streams of "time-aligned microturns" for continuous interactivity.
In practice
- Utilize SGLang for implementing real-time multimodal AI stacks.
- Explore new benchmarks like TimeSpeak and CueSpeak for evaluating time-aware AI.
- Consider DeepSeek V4 Flash for cost-efficient, high-volume agent workloads.
Topics
- Real-time Multimodal AI
- Thinking Machines Interaction Models
- OpenAI Enterprise Solutions
- AI Agent Orchestration
- Coding Agent Benchmarking
Best for: AI Engineer, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Latent.Space - Www.latent.space.