Language Models Trained on State Media Sources Launder Propaganda

· Source: Tech Policy Press · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation, Cybersecurity & Data Privacy · Depth: Intermediate, short

Summary

A new article in Nature, co-authored by researchers from the University of Oregon, Purdue University, UC San Diego, NYU, and Princeton, reveals how state media control influences Large Language Model (LLM) outputs, particularly concerning China. The study found that content scripted by China's Publicity Department frequently appears in common open-source multilingual training datasets, and widely used LLMs have memorized distinctive Chinese state-curated media content. For instance, OpenAI's GPT-3.5 generated responses "substantially more favourable towards China" when prompted in Chinese compared to English. Across 37 nations where 70% of speakers of a particular language reside, countries with greater state media control produced more pro-regime responses from LLMs queried in their official language versus English, correlating with lower World Press Freedom Index scores. This process effectively launders government-manipulated content into ostensibly objective text.

Key takeaway

For AI developers and research scientists building or deploying LLMs, you must carefully scrutinize training data sources, especially for multilingual models, to mitigate the risk of inadvertently amplifying state-sponsored propaganda. Your models may be subtly laundering strategic rhetoric into objective-seeming information, potentially incentivizing political actors to further manipulate online content. Implement robust data provenance checks and bias detection mechanisms to ensure output neutrality and prevent unintended influence.

Key insights

LLMs trained on state-controlled media data can inadvertently launder propaganda into seemingly objective outputs.

Principles

Method

Researchers conducted six studies, including analyzing open-source datasets for state-scripted content, examining GPT-3.5 responses to China-related prompts in different languages, and correlating LLM output favorability with World Press Freedom Index scores across 37 nations.

In practice

Topics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Ethicist, Policy Maker

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Tech Policy Press.