Microsoft's new MAI models

· Source: Simon Willison's Weblog · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Intermediate, quick

Summary

Microsoft announced two new large language models on June 2nd, 2026: MAI-Thinking-1 and MAI-Code-1-Flash. MAI-Thinking-1, a 1-trillion-parameter model with 35 billion active parameters, is designed for reasoning and is available to select early partners, claiming preference over Sonnet 4.6 in blind human evaluations. MAI-Code-1-Flash, with 137 billion parameters and 5 billion active, is optimized for GitHub Copilot and VS Code, rolling out to individual users. Initially, Microsoft stated both models were trained on "clean and commercially licensed data." However, subsequent details from the MAI-Thinking-1 technical paper revealed the training corpus includes a proprietary web crawl (794 billion pages after filtering) and 24.2 billion pages from Common Crawl. This data undergoes filtering for adult/piracy content and AI-generated content.

Key takeaway

For AI Ethicists and Legal Professionals evaluating new LLMs, scrutinize claims of "commercially licensed" or "clean" training data. Microsoft's MAI models, despite initial claims, rely on extensive web crawls, underscoring the need for detailed data provenance disclosures. Ensure your organization's LLM adoption strategy accounts for the actual data sources and potential legal or ethical implications, especially regarding content generated by AI or scraped from the public web.

Key insights

Microsoft's new MAI models leverage MoE architectures and filtered web data, highlighting ongoing data licensing complexities.

Principles

Method

Training involves proprietary web crawls, Common Crawl, UT1 block lists, and AI-content detection to filter 1.2 trillion pages to 794 billion.

In practice

Topics

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, AI Ethicist, Legal Professional

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Simon Willison's Weblog.