Microsoft deletes blog telling users to train AI on pirated Harry Potter books
Summary
Microsoft recently deleted a blog post that instructed users on how to train large language models (LLMs) using pirated Harry Potter books to create Q&A systems and generate fan fiction. The blog, authored by a Microsoft employee, demonstrated building a Q&A system capable of retrieving specific book excerpts for queries like "Wizarding World snacks." It also explored generating new stories by combining existing narratives with retrieved passages. One example involved creating fan fiction where Harry Potter learns about Microsoft's Native Vector Support in SQL from a new friend on the Hogwarts Express, complete with a Microsoft-branded image. This approach raised concerns among rights holders regarding potential copyright infringement based on the model's outputs.
Key takeaway
For AI/ML development teams considering training models on third-party content, you must prioritize legal review of data sources. Ensure all training data is properly licensed or falls under fair use to avoid copyright infringement risks, especially when generating derivative works like fan fiction or Q&A responses that closely mirror original texts. Your legal team should vet content acquisition strategies before model deployment.
Key insights
Training LLMs on copyrighted material for Q&A and fan fiction raises significant intellectual property concerns.
Principles
- Content generation must respect intellectual property.
- Model outputs can infringe copyright.
Method
The method involved training LLMs on text files to create Q&A systems and generate new stories by retrieving contextually similar excerpts from the dataset.
In practice
- Use LLMs for Q&A systems.
- Generate fan fiction with LLMs.
Topics
- Large Language Models
- Generative AI
- Copyright Infringement
- Q&A Systems
- Vector Databases
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Engineer, AI Ethicist, Tech Journalist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI - Ars Technica.