Porn, dog poo and social media snaps: the ‘taskers’ scraping the internet for AI firm part-owned by Meta
Summary
Scale AI, a company 49%-controlled by Meta, utilizes its Outlier platform to recruit tens of thousands of gig workers, including experts in medicine, physics, and economics, to refine AI systems. These "taskers" report being paid to scrape personal data from Instagram and Facebook accounts, harvest copyrighted artwork, and transcribe pornographic soundtracks, tasks they describe as morally uncomfortable and divergent from high-level AI refinement. Workers expressed concerns about collecting data from users, including minors, without their explicit understanding, and contributing to their own job displacement. The Guardian's investigation, based on interviews with 10 Outlier contractors, revealed instances of monitoring via Hubstaff and allegations of "bait-and-switch" pay tactics. Scale AI, which contracts with the Pentagon and major tech companies, stated that Outlier offers flexible work with transparent pay, and that inappropriate content is addressed, though it confirmed using children's public social media data.
Key takeaway
For CTOs and VPs of Engineering evaluating AI model development, you should scrutinize the data sourcing and labeling practices of third-party vendors like Scale AI. Understand the ethical implications and potential legal risks associated with data scraped from social media, copyrighted works, and sensitive content. Prioritize vendors with transparent, auditable data governance policies to mitigate reputational damage and ensure compliance with evolving data privacy regulations, especially concerning user consent and minor data.
Key insights
AI training relies on a vast gig workforce performing ethically questionable data collection from public and private sources.
Principles
- Data collection for AI training often blurs ethical boundaries.
- Gig workers face precarious conditions in the AI economy.
Method
AI models are refined by human "taskers" who label, transcribe, and scrape diverse data, including social media profiles and copyrighted content, often under monitoring.
In practice
- Public social media data is actively used for AI training.
- AI training involves tasks like transcribing sensitive audio.
Topics
- Scale AI
- Outlier Platform
- AI Data Labeling
- Gig Work Ethics
- Social Media Scraping
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Ethicist, Policy Maker, General Interest
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI (artificial intelligence) | The Guardian.