Thinking Machines shows off preview of near-realtime AI voice and video conversation with new 'interaction models'

· Source: VentureBeat · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Emerging Technologies & Innovation · Depth: Advanced, long

Summary

Thinking Machines, an AI startup founded by former OpenAI CTO Mira Murati and researcher John Schulman, has announced a research preview of "interaction models," a new class of native multimodal systems. These models prioritize interactivity within their architecture, moving beyond the traditional "turn-based" AI interaction. The core innovation is a "full-duplex" architecture that processes 200ms chunks of input and output simultaneously across text, imagery, audio, and video. This enables real-time responses, backchanneling, and proactive interjections. The preview introduces TML-Interaction-Small, a 276-billion parameter Mixture-of-Experts (MoE) model, which features a dual system with an Interaction Model for immediate exchanges and a Background Model for asynchronous reasoning. Benchmarks like FD-bench show TML-Interaction-Small achieving a turn-taking latency of 0.40 seconds and an interaction quality score of 77.8, significantly outperforming competitors like Gemini-3.1-flash-live and GPT-realtime-2.0.

Key takeaway

For CTOs and VPs of Engineering evaluating next-generation AI integration, Thinking Machines' interaction models represent a significant leap beyond current turn-based systems. Their ability to handle simultaneous multimodal input and output, demonstrated by a 0.40-second turn-taking latency, could fundamentally transform enterprise applications requiring natural, real-time human-AI collaboration. You should monitor their upcoming limited research preview for potential pilot programs, especially for use cases in customer service, industrial monitoring, or any scenario where sub-second responsiveness and proactive AI engagement are critical.

Key insights

Thinking Machines' interaction models enable real-time, full-duplex human-AI communication by processing multimodal inputs and outputs simultaneously.

Principles

Method

The system uses a multi-stream, micro-turn design, processing 200ms chunks of input/output simultaneously. It employs encoder-free early fusion, taking raw audio (dMel) and image patches (40x40) through a lightweight embedding layer, co-training all components within a transformer.

In practice

Topics

Best for: Machine Learning Engineer, CTO, VP of Engineering/Data, AI Scientist, AI Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by VentureBeat.