Microsoft's new MAI models
Summary
Microsoft announced two new large language models on June 2nd, 2026: MAI-Thinking-1 and MAI-Code-1-Flash. MAI-Thinking-1, a 1-trillion-parameter model with 35 billion active parameters, is designed for reasoning and is available to select early partners, claiming preference over Sonnet 4.6 in blind human evaluations. MAI-Code-1-Flash, with 137 billion parameters and 5 billion active, is optimized for GitHub Copilot and VS Code, rolling out to individual users. Initially, Microsoft stated both models were trained on "clean and commercially licensed data." However, subsequent details from the MAI-Thinking-1 technical paper revealed the training corpus includes a proprietary web crawl (794 billion pages after filtering) and 24.2 billion pages from Common Crawl. This data undergoes filtering for adult/piracy content and AI-generated content.
Key takeaway
For AI Ethicists and Legal Professionals evaluating new LLMs, scrutinize claims of "commercially licensed" or "clean" training data. Microsoft's MAI models, despite initial claims, rely on extensive web crawls, underscoring the need for detailed data provenance disclosures. Ensure your organization's LLM adoption strategy accounts for the actual data sources and potential legal or ethical implications, especially regarding content generated by AI or scraped from the public web.
Key insights
Microsoft's new MAI models leverage MoE architectures and filtered web data, highlighting ongoing data licensing complexities.
Principles
- MoE models balance scale with active parameter efficiency.
- "Licensed data" claims often mask web-crawled origins.
- Filtering AI-generated content improves training corpus quality.
Method
Training involves proprietary web crawls, Common Crawl, UT1 block lists, and AI-content detection to filter 1.2 trillion pages to 794 billion.
In practice
- Assess LLM efficiency via active parameter counts.
- Verify data provenance beyond marketing claims.
- Implement AI-content detection in data pipelines.
Topics
- Microsoft MAI Models
- Large Language Models
- Mixture-of-Experts
- Training Data Licensing
- Web Crawling
- AI Content Detection
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, AI Ethicist, Legal Professional
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Simon Willison's Weblog.