MIT scientists investigate memorization risk in the age of clinical AI
Summary
MIT researchers, led by Sana Tonekaboni and Marzyeh Ghassemi, presented new research at the 2025 Conference on Neural Information Processing Systems (NeurIPS) investigating how AI foundation models trained on de-identified electronic health records (EHRs) can inadvertently memorize patient-specific information. This memorization risk, distinct from generalization, could lead to privacy violations if adversarial attackers prompt models to extract sensitive data. The team developed a rigorous testing framework to evaluate leakage in a healthcare context, assessing practical risk based on the amount of information an attacker possesses and the sensitivity of the leaked data. They found that more attacker knowledge increases leakage likelihood and that patients with unique conditions are particularly vulnerable, even with de-identified data. This work aims to establish practical evaluation steps for the community before releasing such models.
Key takeaway
For CTOs and VPs of Engineering overseeing AI model development in healthcare, you must integrate robust privacy evaluation tests into your model release pipeline. Your teams should focus on distinguishing true memorization from generalization and assessing the practical risk of data leakage, especially for sensitive patient information or unique conditions. Prioritize developing safeguards that account for the attacker's potential knowledge to mitigate privacy breaches effectively.
Key insights
AI models trained on EHRs can memorize patient data, necessitating rigorous testing to prevent privacy breaches.
Principles
- Evaluate AI leakage in healthcare context.
- Distinguish generalization from memorization.
- Assess practical risk based on attacker knowledge.
Method
The research team developed a series of tests to measure various types of uncertainty and assess practical risk to patients by measuring different tiers of attack possibility, distinguishing generalization from patient-level memorization.
In practice
- Implement privacy evaluation tests for EHR models.
- Prioritize protection for patients with unique conditions.
- Differentiate data leakage severity.
Topics
- AI Memorization
- Clinical AI
- Patient Privacy
- Electronic Health Records
- Foundation Models
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Researcher, AI Security Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by MIT News - Machine learning.