Microsoft's new MAI models

2026-06-02 · Source: Simon Willison's Weblog · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Intermediate, quick

Summary

Microsoft announced two new large language models on June 2nd, 2026: MAI-Thinking-1 and MAI-Code-1-Flash. MAI-Thinking-1, a 1-trillion-parameter model with 35 billion active parameters, is designed for reasoning and is available to select early partners, claiming preference over Sonnet 4.6 in blind human evaluations. MAI-Code-1-Flash, with 137 billion parameters and 5 billion active, is optimized for GitHub Copilot and VS Code, rolling out to individual users. Initially, Microsoft stated both models were trained on "clean and commercially licensed data." However, subsequent details from the MAI-Thinking-1 technical paper revealed the training corpus includes a proprietary web crawl (794 billion pages after filtering) and 24.2 billion pages from Common Crawl. This data undergoes filtering for adult/piracy content and AI-generated content.

Key takeaway

For AI Ethicists and Legal Professionals evaluating new LLMs, scrutinize claims of "commercially licensed" or "clean" training data. Microsoft's MAI models, despite initial claims, rely on extensive web crawls, underscoring the need for detailed data provenance disclosures. Ensure your organization's LLM adoption strategy accounts for the actual data sources and potential legal or ethical implications, especially regarding content generated by AI or scraped from the public web.

Key insights

Microsoft's new MAI models leverage MoE architectures and filtered web data, highlighting ongoing data licensing complexities.

Principles

MoE models balance scale with active parameter efficiency.
"Licensed data" claims often mask web-crawled origins.
Filtering AI-generated content improves training corpus quality.

Method

Training involves proprietary web crawls, Common Crawl, UT1 block lists, and AI-content detection to filter 1.2 trillion pages to 794 billion.

In practice

Assess LLM efficiency via active parameter counts.
Verify data provenance beyond marketing claims.
Implement AI-content detection in data pipelines.

Topics

Microsoft MAI Models
Large Language Models
Mixture-of-Experts
Training Data Licensing
Web Crawling
AI Content Detection

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, AI Ethicist, Legal Professional

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Simon Willison's Weblog.