The Atlantic created a searchable database of the music used to train AI
Summary
The Atlantic reporter Alex Reisner has compiled and made publicly searchable four significant music datasets used to train AI models. These datasets include two massive collections of 12 million and 9 million tracks, alongside two smaller sets each containing over 100,000 songs. Thousands of downloads have occurred, with Google and Stability confirming their utilization in research. While some sources like the Free Music Archive permit personal streaming, commercial applications necessitate licensing. Reisner highlights that these datasets are often distributed as lists of YouTube or Spotify links, requiring automated tools for audio download. These tools frequently bypass platform terms of service, logins, advertisements, and creator monetization mechanisms. The searchable database reveals the inclusion of music from diverse artists such as Lady Gaga, Radiohead, Wu-Tang Clan, and Bruce Springsteen, offering transparency into AI training data.
Key takeaway
For legal professionals assessing copyright infringement risks in AI-generated music, you should scrutinize the provenance of training datasets. The widespread use of automated tools to bypass platform terms of service for data acquisition, as revealed by The Atlantic, indicates significant legal vulnerabilities. You must verify that your AI models are not trained on data obtained through such methods, especially when commercial applications are intended, to mitigate potential lawsuits from artists and content platforms.
Key insights
AI music models are trained on vast, publicly available datasets, often acquired through methods that violate platform terms of service.
Principles
- AI training data sourcing often violates platform terms.
- Public availability does not grant commercial use rights.
- Transparency reveals widespread artist inclusion.
Method
AI developers acquire training audio by automating downloads from YouTube or Spotify links, often using tools that bypass platform terms of service, logins, ads, and creator monetization.
In practice
- Search The Atlantic's AI Watchdog for artists.
- Verify licensing for commercially used datasets.
Topics
- AI Music Training
- Copyright Infringement
- Data Scraping
- Content Licensing
- AI Watchdog
- Music Datasets
Best for: CTO, VP of Engineering/Data, Director of AI/ML, Tech Journalist, AI Ethicist, Legal Professional
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Verge.