Dozens of AI disease-prediction models were trained on dubious data

2026-04-15 · Source: Machine learning : nature.com subject feeds · Field: Health & Wellbeing — Medical Devices & Health Technology, Clinical Care & Medical Practice, Health & Medical Research · Depth: Advanced, quick

Summary

Researchers have identified 124 peer-reviewed papers that used two open-access health datasets from Kaggle, the "Stroke Prediction Dataset" and the "Diabetes prediction data set," to train AI models for predicting stroke and diabetes risk. These datasets, downloaded over 288,000 times combined, exhibit significant irregularities, such as unusually complete data and duplicated values, suggesting they may be fabricated rather than derived from real patients. At least two of these AI models have reportedly been used in clinical settings in Indonesia and Spain, with one also appearing in a 2024 medical-device patent application. Public health experts warn that models trained on such unreliable data are intrinsically untrustworthy and could lead to flawed diagnoses and inappropriate treatment decisions. Two journals are currently investigating studies that utilized these questionable datasets.

Key takeaway

For CTOs and VPs of Engineering overseeing AI development in healthcare, you must prioritize rigorous data provenance and validation. Models trained on unverified or potentially fabricated data, even from widely used public repositories like Kaggle, pose significant clinical risks and ethical liabilities. Implement strict data governance policies requiring full disclosure of data sources and conduct thorough integrity checks to ensure your AI systems are built on reliable, real-world information, preventing flawed diagnoses and inappropriate patient care.

Key insights

AI models for disease prediction are being trained on potentially fabricated public datasets, risking patient harm.

Principles

Data provenance is critical for clinical AI reliability.
Incomplete data is a hallmark of real-world health datasets.

In practice

Verify data source and integrity before model training.
Insist on data disclosure for medical AI research.
Remove dubious datasets from public platforms.

Topics

AI Disease Prediction
Dubious Medical Data
Clinical AI Models
Kaggle Datasets
Data Provenance

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, Research Scientist, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine learning : nature.com subject feeds.