Epidemiology of Model Collapse: Modeling Synthetic Data Contamination via Bilayer SIR Dynamics
Summary
The paper introduces a "bilayer coupled SIR/SIRS" framework to model model collapse, treating data corpora and AI models as two interacting populations susceptible to synthetic data contamination. This phenomenological mean-field model, which incorporates immunity waning in its SIRS variant, derives a basic reproduction number R₀=√[β₄β₁₄/((γ₄+μ₄)(γ₁₄+μ₁₄))]. Illustrative calibration using public AI text prevalence data (projected up to 74% by 2025) shows supercritical dynamics (R₀>1) across three scenarios. Sobol sensitivity analysis identifies synthetic-text detection (γ₄) as the highest-impact parameter. GPT-2 experiments (192 single-chain runs, 1,088 source-diversity runs) confirm dose-response degradation and diversity loss (Distinct-2 drops from 0.68 to 0.38). Multi-source mixing modestly attenuates collapse at 100% contamination (α=1) by ~2 PPL, but this effect vanishes at 50% contamination (α=0.5).
Key takeaway
For AI Architects evaluating data pipeline resilience, this research highlights that reducing synthetic data contamination fraction is paramount. Your efforts should focus on robust detection and filtering mechanisms (increasing γ₄) rather than solely diversifying data sources. While multi-source mixing offers a modest buffer at full contamination, its effect disappears at realistic partial contamination levels, confirming that direct contamination control is the most effective strategy to prevent model collapse.
Key insights
AI model collapse from synthetic data cross-contamination can be modeled as a bilayer epidemic system with a quantifiable reproduction number R₀.
Principles
- Ecosystem contamination is a network, not a chain.
- R₀ is a geometric mean across data and model layers.
- Detection is the highest-impact intervention parameter.
Method
The bilayer coupled SIR/SIRS framework uses ODEs to model data corpora and AI models as interacting populations with susceptible, infected, and recovered compartments, linked by cross-layer transmission.
In practice
- Prioritize detection and filtering to reduce contamination fraction.
- Understand R₀ to assess ecosystem contamination risk.
- Consider herd immunity strategies for data hygiene.
Topics
- Model Collapse
- Synthetic Data Contamination
- Epidemic Modeling
- SIR/SIRS Framework
- GPT-2
- Data Hygiene
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.