Dozens of AI disease-prediction models were trained on dubious data
Summary
Researchers have identified 124 peer-reviewed papers that used two open-access health datasets from Kaggle, the "Stroke Prediction Dataset" and the "Diabetes prediction data set," to train AI models for predicting stroke and diabetes risk. These datasets, downloaded over 288,000 times combined, exhibit significant irregularities, such as unusually complete data and duplicated values, suggesting they may be fabricated rather than derived from real patients. At least two of these AI models have reportedly been used in clinical settings in Indonesia and Spain, with one also appearing in a 2024 medical-device patent application. Public health experts warn that models trained on such unreliable data are intrinsically untrustworthy and could lead to flawed diagnoses and inappropriate treatment decisions. Two journals are currently investigating studies that utilized these questionable datasets.
Key takeaway
For CTOs and VPs of Engineering overseeing AI development in healthcare, you must prioritize rigorous data provenance and validation. Models trained on unverified or potentially fabricated data, even from widely used public repositories like Kaggle, pose significant clinical risks and ethical liabilities. Implement strict data governance policies requiring full disclosure of data sources and conduct thorough integrity checks to ensure your AI systems are built on reliable, real-world information, preventing flawed diagnoses and inappropriate patient care.
Key insights
AI models for disease prediction are being trained on potentially fabricated public datasets, risking patient harm.
Principles
- Data provenance is critical for clinical AI reliability.
- Incomplete data is a hallmark of real-world health datasets.
In practice
- Verify data source and integrity before model training.
- Insist on data disclosure for medical AI research.
- Remove dubious datasets from public platforms.
Topics
- AI Disease Prediction
- Dubious Medical Data
- Clinical AI Models
- Kaggle Datasets
- Data Provenance
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, Research Scientist, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine learning : nature.com subject feeds.