Interaction Models: A Scalable Approach to Human-AI Collaboration
Summary
Thinking Machines announced a research preview of "interaction models" on May 11, 2026, designed to enable real-time, multimodal human-AI collaboration. These models natively handle continuous audio, video, and text input, allowing for simultaneous perception, thought, response, and action. Unlike traditional turn-based AI systems that rely on external components for interactivity, interaction models integrate these capabilities directly into their architecture, ensuring responsiveness scales with intelligence. The `TML-Interaction-Small` model, a 276B parameter MoE with 12B active parameters, demonstrates superior performance on interactivity benchmarks like FD-bench v1.5 and strong intelligence on Audio MultiChallenge. Key capabilities include seamless dialog management, verbal and visual interjections, simultaneous speech, time-awareness, and concurrent tool calls, search, and generative UI.
Key takeaway
For research scientists developing advanced AI systems, you should prioritize integrating real-time, multimodal interaction directly into model architectures rather than relying on external harnesses. This approach, exemplified by Thinking Machines' interaction models, significantly improves collaboration bandwidth and unlocks capabilities like simultaneous speech and visual proactivity, which are crucial for more natural and effective human-AI partnerships. Consider exploring continuous stream processing and asynchronous background models to balance responsiveness with deep reasoning.
Key insights
Native, real-time multimodal interaction within AI models enhances human-AI collaboration beyond turn-based systems.
Principles
- Interactivity must scale with intelligence.
- Continuous streams are superior to alternating token sequences.
- Delegate deep reasoning to asynchronous background models.
Method
The approach uses a time-aware interaction model for real-time presence and an asynchronous background model for sustained reasoning, both sharing context. It processes 200ms micro-turns with encoder-free early fusion and optimized inference.
In practice
- Implement streaming sessions for low-latency inference.
- Co-train all multimodal components from scratch.
- Use automated red-teaming for long-horizon safety.
Topics
- Interaction Models
- Human-AI Collaboration
- Real-time Multimodal AI
- Micro-turn Architecture
- AI Performance Benchmarks
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Thinking Machines Lab - Connectionism.