Mira Murati’s Thinking Machines Lab Introduces Interaction Models: A Native Multimodal Architecture for Real-Time Human-AI Collaboration
Summary
Mira Murati's Thinking Machines Lab has unveiled a research preview of TML-Interaction-Small, a 276B MoE model with 12B active parameters designed for real-time human-AI collaboration. This model employs a native multimodal architecture, processing 200ms chunks of audio, video, and text simultaneously through a multi-stream, time-aligned micro-turn system. Unlike conventional real-time AI that often relies on bolted-on voice-activity detection, TML-Interaction-Small integrates interactivity directly into its weights, eliminating external turn-detection scaffolding. It supports full-duplex interaction and asynchronous background reasoning, sharing full conversation context. The model achieved a 77.8 score on FD-bench v1.5 compared to 47.8 for GPT-realtime-2.0, and a 32.4 Charades mIoU for visual proactivity, significantly outperforming GPT-realtime-2.0's 0.
Key takeaway
For AI Engineers developing real-time conversational AI, you should investigate architectures that co-train multimodal inputs and embed interaction directly into the model's weights. This approach, exemplified by TML-Interaction-Small, offers superior performance in full-duplex interaction and visual proactivity compared to systems relying on external turn-detection, potentially enabling more natural and efficient human-AI collaboration.
Key insights
Native multimodal architectures can enable true real-time human-AI collaboration by integrating interactivity into model weights.
Principles
- Co-train modalities from scratch.
- Integrate interactivity into model weights.
Method
The TML-Interaction-Small model uses a multi-stream, time-aligned micro-turn architecture to process 200ms chunks of audio (dMel), video (40x40 hMLP patches), and text simultaneously.
In practice
- Implement full-duplex interaction.
- Run asynchronous background reasoning.
Topics
- Interaction Models
- TML-Interaction-Small
- Multimodal AI
- Real-time AI
- Full-duplex Interaction
Best for: AI Engineer, Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.