Interaction Models: A Scalable Approach to Human-AI Collaboration

2026-05-11 · Source: Thinking Machines Lab - Connectionism · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Software Development & Engineering · Depth: Expert, long

Summary

Thinking Machines announced a research preview of "interaction models" on May 11, 2026, designed to enable real-time, multimodal human-AI collaboration. These models natively handle continuous audio, video, and text input, allowing for simultaneous perception, thought, response, and action. Unlike traditional turn-based AI systems that rely on external components for interactivity, interaction models integrate these capabilities directly into their architecture, ensuring responsiveness scales with intelligence. The `TML-Interaction-Small` model, a 276B parameter MoE with 12B active parameters, demonstrates superior performance on interactivity benchmarks like FD-bench v1.5 and strong intelligence on Audio MultiChallenge. Key capabilities include seamless dialog management, verbal and visual interjections, simultaneous speech, time-awareness, and concurrent tool calls, search, and generative UI.

Key takeaway

For research scientists developing advanced AI systems, you should prioritize integrating real-time, multimodal interaction directly into model architectures rather than relying on external harnesses. This approach, exemplified by Thinking Machines' interaction models, significantly improves collaboration bandwidth and unlocks capabilities like simultaneous speech and visual proactivity, which are crucial for more natural and effective human-AI partnerships. Consider exploring continuous stream processing and asynchronous background models to balance responsiveness with deep reasoning.

Key insights

Native, real-time multimodal interaction within AI models enhances human-AI collaboration beyond turn-based systems.

Principles

Interactivity must scale with intelligence.
Continuous streams are superior to alternating token sequences.
Delegate deep reasoning to asynchronous background models.

Method

The approach uses a time-aware interaction model for real-time presence and an asynchronous background model for sustained reasoning, both sharing context. It processes 200ms micro-turns with encoder-free early fusion and optimized inference.

In practice

Implement streaming sessions for low-latency inference.
Co-train all multimodal components from scratch.
Use automated red-teaming for long-horizon safety.

Topics

Interaction Models
Human-AI Collaboration
Real-time Multimodal AI
Micro-turn Architecture
AI Performance Benchmarks

Code references

sgl-project/sglang

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Thinking Machines Lab - Connectionism.