Microsoft trained its MAI models on unlicensed web data despite promising "enterprise grade, clean and commercially licensed data"
Summary
Microsoft's new MAI models, despite prior assurances of training exclusively on "enterprise grade, clean and commercially licensed data," were partly developed using unlicensed web data, including Common Crawl. A technical paper from Microsoft confirms this data pipeline, which also incorporates a proprietary crawler that respects the Robots Exclusion Protocol (robots.txt) and related meta-tag controls. This approach places the onus of content protection on site owners, mirroring practices of other AI companies that rely on contested "fair use" arguments for scraping copyrighted material. The legal landscape for AI training on vast troves of copyrighted works remains unresolved in courts, highlighting a discrepancy between Microsoft's public claims and its actual data sourcing.
Key takeaway
For legal professionals advising on AI development or content creators concerned about intellectual property, this revelation underscores the ongoing legal ambiguity surrounding AI training data. You should scrutinize vendor claims regarding data licensing and understand that reliance on "fair use" for web-scraped content remains highly contested in courts. Proactively implement robots.txt and meta-tag controls to manage how your site's content is accessed by crawlers, and prepare for potential litigation as copyright law evolves.
Key insights
Microsoft's MAI models used unlicensed web data, challenging its "clean data" claims amidst ongoing fair use legal disputes.
Principles
- AI training often relies on contested fair use claims.
- Content protection burden shifts to site owners via robots.txt.
Method
Microsoft employs a proprietary web crawler that adheres to the Robots Exclusion Protocol (robots.txt) and related meta-tag/HTML controls to access and use web content.
In practice
- Site owners can manage content access via robots.txt.
- Evaluate AI vendor data sourcing claims critically.
Topics
- AI Training Data
- Copyright Law
- Fair Use
- Web Scraping
- MAI Models
- Robots Exclusion Protocol
- Data Licensing
Best for: CTO, Executive, Investor, AI Ethicist, Legal Professional, Tech Journalist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Decoder.