The Atlantic created a searchable database of the music used to train AI

· Source: The Verge · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI in Music, AI Data Governance & Ethics · Depth: Novice, quick

Summary

The Atlantic reporter Alex Reisner has compiled and made publicly searchable four significant music datasets used to train AI models. These datasets include two massive collections of 12 million and 9 million tracks, alongside two smaller sets each containing over 100,000 songs. Thousands of downloads have occurred, with Google and Stability confirming their utilization in research. While some sources like the Free Music Archive permit personal streaming, commercial applications necessitate licensing. Reisner highlights that these datasets are often distributed as lists of YouTube or Spotify links, requiring automated tools for audio download. These tools frequently bypass platform terms of service, logins, advertisements, and creator monetization mechanisms. The searchable database reveals the inclusion of music from diverse artists such as Lady Gaga, Radiohead, Wu-Tang Clan, and Bruce Springsteen, offering transparency into AI training data.

Key takeaway

For legal professionals assessing copyright infringement risks in AI-generated music, you should scrutinize the provenance of training datasets. The widespread use of automated tools to bypass platform terms of service for data acquisition, as revealed by The Atlantic, indicates significant legal vulnerabilities. You must verify that your AI models are not trained on data obtained through such methods, especially when commercial applications are intended, to mitigate potential lawsuits from artists and content platforms.

Key insights

AI music models are trained on vast, publicly available datasets, often acquired through methods that violate platform terms of service.

Principles

Method

AI developers acquire training audio by automating downloads from YouTube or Spotify links, often using tools that bypass platform terms of service, logins, ads, and creator monetization.

In practice

Topics

Best for: CTO, VP of Engineering/Data, Director of AI/ML, Tech Journalist, AI Ethicist, Legal Professional

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Verge.