Encyclopedia Britannica sues OpenAI for training on nearly 100,000 articles without permission
Summary
Encyclopedia Britannica and its subsidiary Merriam-Webster have filed a lawsuit against OpenAI in federal court in Manhattan, alleging that OpenAI used nearly 100,000 online articles, encyclopedia entries, and dictionary definitions without permission to train its AI models. The complaint, first reported by Reuters, claims that ChatGPT can produce near-verbatim copies of Britannica content, thereby diverting users from Britannica's own websites. Additionally, Britannica accuses OpenAI of trademark infringement, asserting that ChatGPT's responses create a false impression of endorsement and inaccurately cite Britannica as a source. The lawsuit seeks damages and an injunction, citing that GPT-4 has "memorized" significant portions of Britannica's copyrighted content and can reproduce them on demand. This legal action highlights a broader debate in courts regarding whether AI models "store" copyrighted works in their parameters, with differing rulings from courts in Munich and the UK High Court on similar issues.
Key takeaway
For CTOs and legal teams evaluating AI model deployment, this lawsuit underscores the critical need to scrutinize training data provenance and potential copyright infringement risks. Your organization should implement robust content filtering and attribution mechanisms to prevent the reproduction of copyrighted material and mitigate legal exposure from "memorized" data. Proactively assess your AI models' outputs for verbatim content to avoid similar litigation and reputational damage.
Key insights
AI models' ability to reproduce copyrighted content from training data is a central legal and technical challenge.
Principles
- AI model weights can embed reproducible content.
- Verbatim output implies unauthorized copying.
In practice
- Audit AI model outputs for verbatim content.
- Review training data licensing agreements.
Topics
- Copyright Infringement
- AI Model Training
- Large Language Models
- Legal Disputes
- Data Memorization
Best for: CTO, VP of Engineering/Data, Director of AI/ML, Legal Professional, AI Ethicist, Tech Journalist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Decoder.