[AINews] Thinking Machines' Native Interaction Models - TML-Interaction-Small 276B-A12B - advances SOTA Realtime Voice and kills standard VAD

2026-05-12 · Source: Latent.Space - Www.latent.space · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Software Development & Engineering · Depth: Advanced, medium

Summary

Thinking Machines has released "Interaction Models: A Scalable Approach to Human-AI Collaboration," featuring TML-Interaction-Small, a 276B parameter Mixture-of-Experts model with 12B active parameters. This model significantly advances real-time voice interaction, moving beyond turn-based AI by processing images and audio in under 200ms using encoder-free early fusion, similar to Meta's Chameleon. The system demonstrates continuous interactivity with "time-aligned microturns" of 200ms. Thinking Machines also introduced new internal benchmarks like TimeSpeak, CueSpeak, RepCount-A, and ProactiveVideoQA to measure time awareness, simultaneous translation, and visual proactivity, outperforming existing models like GPT-Realtime-2 and Gemini 3.1-Flash on various tasks. OpenAI also announced the OpenAI Deployment Company, investing $4B with 19 partners to embed 150 Forward Deployed Engineers, and launched Daybreak for security-specific model distribution and cyber operations.

Key takeaway

For AI Engineers and CTOs building interactive AI systems, prioritize models designed for native, full-duplex multimodal interaction rather than layering capabilities onto turn-based LLMs. Your focus should shift to systems that can concurrently process audio, video, and text with sub-200ms latency, enabling continuous human-AI bandwidth. Evaluate solutions like Thinking Machines' approach or OpenAI's Daybreak for robust deployment and security, and consider local-first agent tooling for greater control and efficiency.

Key insights

Real-time, full-duplex multimodal interaction models are advancing human-AI collaboration beyond turn-based systems.

Principles

Models should be trained from scratch for real-time interaction.
Human-AI bandwidth requires concurrent listening, speaking, watching, thinking, and reacting.
Native interactivity is a deeper innovation than raw benchmark claims.

Method

Thinking Machines uses encoder-free early fusion to process images and audio in under 200ms, focusing on streams of "time-aligned microturns" for continuous interactivity.

In practice

Utilize SGLang for implementing real-time multimodal AI stacks.
Explore new benchmarks like TimeSpeak and CueSpeak for evaluating time-aware AI.
Consider DeepSeek V4 Flash for cost-efficient, high-volume agent workloads.

Topics

Real-time Multimodal AI
Thinking Machines Interaction Models
OpenAI Enterprise Solutions
AI Agent Orchestration
Coding Agent Benchmarking

Best for: AI Engineer, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Latent.Space - Www.latent.space.