7 myths about “more data” (and why models get worse)
Summary
Adding more data to machine learning models can paradoxically degrade performance, leading to decreased accuracy, poor calibration, and accelerated production drift. This phenomenon often stems from issues related to the type of data added, its collection methodology, and the lessons it imparts to the model. The common advice to "just add more data" is a pervasive myth, as it overlooks critical failure modes such as noise, distribution shift, and data leakage. Instead of viewing data merely as "fuel," it should be considered a "teacher" that can inadvertently poison labels, distort distributions, or convey incorrect lessons, ultimately making models worse rather than better.
Key takeaway
For Data Scientists and MLOps Engineers struggling with model degradation after dataset expansion, you should critically evaluate the quality and collection methods of new data. Do not assume more data inherently improves models; instead, actively look for signs of label poisoning, distribution shifts, and data leakage. Prioritize data hygiene and relevance over sheer volume to prevent models from learning incorrect patterns and ensure robust production performance.
Key insights
More data can degrade model performance by introducing noise, distribution shifts, or label poisoning.
Principles
- Data acts as a teacher, not just fuel.
- Quality and relevance outweigh quantity.
- Monitor for noise, shift, and leakage.
In practice
- Audit labels for poisoning.
- Check for distribution shifts.
- Identify data leakage sources.
Topics
- Data Quality
- Model Degradation
- Data Distribution
- Label Noise
- Machine Learning Myths
Best for: Machine Learning Engineer, Data Scientist, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning on Medium.