Epidemiology of Model Collapse: Modeling Synthetic Data Contamination via Bilayer SIR Dynamics

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

The paper introduces a "bilayer coupled SIR/SIRS" framework to model model collapse, treating data corpora and AI models as two interacting populations susceptible to synthetic data contamination. This phenomenological mean-field model, which incorporates immunity waning in its SIRS variant, derives a basic reproduction number R₀=√[β₄β₁₄/((γ₄+μ₄)(γ₁₄+μ₁₄))]. Illustrative calibration using public AI text prevalence data (projected up to 74% by 2025) shows supercritical dynamics (R₀>1) across three scenarios. Sobol sensitivity analysis identifies synthetic-text detection (γ₄) as the highest-impact parameter. GPT-2 experiments (192 single-chain runs, 1,088 source-diversity runs) confirm dose-response degradation and diversity loss (Distinct-2 drops from 0.68 to 0.38). Multi-source mixing modestly attenuates collapse at 100% contamination (α=1) by ~2 PPL, but this effect vanishes at 50% contamination (α=0.5).

Key takeaway

For AI Architects evaluating data pipeline resilience, this research highlights that reducing synthetic data contamination fraction is paramount. Your efforts should focus on robust detection and filtering mechanisms (increasing γ₄) rather than solely diversifying data sources. While multi-source mixing offers a modest buffer at full contamination, its effect disappears at realistic partial contamination levels, confirming that direct contamination control is the most effective strategy to prevent model collapse.

Key insights

AI model collapse from synthetic data cross-contamination can be modeled as a bilayer epidemic system with a quantifiable reproduction number R₀.

Principles

Method

The bilayer coupled SIR/SIRS framework uses ODEs to model data corpora and AI models as interacting populations with susceptible, infected, and recovered compartments, linked by cross-layer transmission.

In practice

Topics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.