7 myths about “more data” (and why models get worse)

· Source: Machine Learning on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, quick

Summary

Adding more data to machine learning models can paradoxically degrade performance, leading to decreased accuracy, poor calibration, and accelerated production drift. This phenomenon often stems from issues related to the type of data added, its collection methodology, and the lessons it imparts to the model. The common advice to "just add more data" is a pervasive myth, as it overlooks critical failure modes such as noise, distribution shift, and data leakage. Instead of viewing data merely as "fuel," it should be considered a "teacher" that can inadvertently poison labels, distort distributions, or convey incorrect lessons, ultimately making models worse rather than better.

Key takeaway

For Data Scientists and MLOps Engineers struggling with model degradation after dataset expansion, you should critically evaluate the quality and collection methods of new data. Do not assume more data inherently improves models; instead, actively look for signs of label poisoning, distribution shifts, and data leakage. Prioritize data hygiene and relevance over sheer volume to prevent models from learning incorrect patterns and ensure robust production performance.

Key insights

More data can degrade model performance by introducing noise, distribution shifts, or label poisoning.

Principles

In practice

Topics

Best for: Machine Learning Engineer, Data Scientist, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning on Medium.