‘Manners for machines’: how new rules could stop AI scrapers destroying the internet
Summary
Australians exhibit high anxiety regarding artificial intelligence, driven by concerns over misinformation, job displacement, and the uncompensated use of creative works for AI model training. AI companies routinely scrape content from various online sources, including pirated books, social media, university repositories, and news outlets, a practice previously tolerated under the "open web" ethos. However, this detente is faltering as news organizations block scrapers and creators limit content sharing. Existing copyright exceptions, like fair dealing, are inadequate for generative AI. In response, Creative Commons proposes "CC Signals," a voluntary framework allowing creators to attach machine-readable instructions to content, specifying permitted machine uses and conditions, based on principles of consent, compensation, and credit. This framework aims to provide creators more control and ensure high-quality data for AI, potentially benefiting smaller creators, despite challenges in enforcing compensation.
Key takeaway
For CTOs and VPs of Engineering evaluating data acquisition strategies, your teams should consider integrating the proposed CC Signals framework into their scraping and data ingestion pipelines. This voluntary system, akin to robots.txt, offers a standardized way to respect creator preferences for AI use, potentially mitigating future legal risks and ensuring access to higher-quality, ethically sourced data, which is crucial for reducing AI biases and improving model utility.
Key insights
CC Signals offer a voluntary, machine-readable framework for creators to manage AI access and use of their online content.
Principles
- Consent, compensation, and credit are foundational.
- Respect and recognition for creators are paramount.
Method
The CC Signals framework allows a "declaring party" to attach machine-readable instructions to content, specifying permitted machine uses and conditions, similar to how robots.txt functions.
In practice
- Implement CC Signals to control AI access to content.
- Utilize CC licenses to specify content reuse conditions.
Topics
- AI Ethics
- Web Scraping
- Copyright Law
- Content Licensing
- Creative Commons
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Ethicist, Policy Maker, Legal Professional
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial intelligence (AI) – The Conversation.