Microsoft deletes blog telling users to train AI on pirated Harry Potter books

· Source: AI - Ars Technica · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI Ethics & Policy · Depth: Novice, quick

Summary

Microsoft recently deleted a blog post that instructed users on how to train large language models (LLMs) using pirated Harry Potter books to create Q&A systems and generate fan fiction. The blog, authored by a Microsoft employee, demonstrated building a Q&A system capable of retrieving specific book excerpts for queries like "Wizarding World snacks." It also explored generating new stories by combining existing narratives with retrieved passages. One example involved creating fan fiction where Harry Potter learns about Microsoft's Native Vector Support in SQL from a new friend on the Hogwarts Express, complete with a Microsoft-branded image. This approach raised concerns among rights holders regarding potential copyright infringement based on the model's outputs.

Key takeaway

For AI/ML development teams considering training models on third-party content, you must prioritize legal review of data sources. Ensure all training data is properly licensed or falls under fair use to avoid copyright infringement risks, especially when generating derivative works like fan fiction or Q&A responses that closely mirror original texts. Your legal team should vet content acquisition strategies before model deployment.

Key insights

Training LLMs on copyrighted material for Q&A and fan fiction raises significant intellectual property concerns.

Principles

Method

The method involved training LLMs on text files to create Q&A systems and generate new stories by retrieving contextually similar excerpts from the dataset.

In practice

Topics

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Engineer, AI Ethicist, Tech Journalist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI - Ars Technica.