[NOTICE] This feed is no longer maintained

· Source: Thinking Machines Lab - Connectionism · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Mathematics & Computational Sciences · Depth: Advanced, extended

Summary

Thinking Machines Lab has introduced "Interaction Models," a new class of AI that handles human-AI interaction natively rather than through external scaffolding. Announced on May 11, 2026, these models are designed to continuously process audio, video, and text, enabling real-time collaboration. The `TML-Interaction-Small` model, a 276B parameter MoE with 12B active, demonstrates qualitatively new interaction capabilities and achieves state-of-the-art combined performance in intelligence and responsiveness. It outperforms existing models on interaction quality benchmarks like FD-bench v1.5 (77.8% vs. 46.8% for GPT-realtime-2.0 minimal) and shows strong intelligence on Audio MultiChallenge (43.4%). The approach utilizes a multi-stream, micro-turn design, processing 200ms chunks of input and output, and integrates an asynchronous background model for deeper reasoning, tool use, and longer-horizon tasks.

Key takeaway

For research scientists developing human-AI collaboration systems, you should explore integrating native interaction capabilities into your AI models. This approach, exemplified by Thinking Machines Lab's Interaction Models, offers superior real-time responsiveness and multimodal collaboration compared to traditional turn-based or harness-dependent systems. Consider adopting a micro-turn architecture and a split interaction/background model design to achieve both high intelligence and seamless, continuous user experience, addressing the "collaboration bottleneck" in current AI interfaces.

Key insights

Native interaction models enable real-time, multimodal human-AI collaboration by integrating interactivity directly into the AI's core architecture.

Principles

Method

Interaction models use a multi-stream, micro-turn design, processing 200ms input/output chunks, and employ encoder-free early fusion for multimodal data, co-trained with the transformer.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Thinking Machines Lab - Connectionism.