MIT scientists investigate memorization risk in the age of clinical AI

2026-01-05 · Source: MIT News - Machine learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Data Science & Analytics · Depth: Advanced, short

Summary

MIT researchers, led by Sana Tonekaboni and Marzyeh Ghassemi, presented new research at the 2025 Conference on Neural Information Processing Systems (NeurIPS) investigating how AI foundation models trained on de-identified electronic health records (EHRs) can inadvertently memorize patient-specific information. This memorization risk, distinct from generalization, could lead to privacy violations if adversarial attackers prompt models to extract sensitive data. The team developed a rigorous testing framework to evaluate leakage in a healthcare context, assessing practical risk based on the amount of information an attacker possesses and the sensitivity of the leaked data. They found that more attacker knowledge increases leakage likelihood and that patients with unique conditions are particularly vulnerable, even with de-identified data. This work aims to establish practical evaluation steps for the community before releasing such models.

Key takeaway

For CTOs and VPs of Engineering overseeing AI model development in healthcare, you must integrate robust privacy evaluation tests into your model release pipeline. Your teams should focus on distinguishing true memorization from generalization and assessing the practical risk of data leakage, especially for sensitive patient information or unique conditions. Prioritize developing safeguards that account for the attacker's potential knowledge to mitigate privacy breaches effectively.

Key insights

AI models trained on EHRs can memorize patient data, necessitating rigorous testing to prevent privacy breaches.

Principles

Evaluate AI leakage in healthcare context.
Distinguish generalization from memorization.
Assess practical risk based on attacker knowledge.

Method

The research team developed a series of tests to measure various types of uncertainty and assess practical risk to patients by measuring different tiers of attack possibility, distinguishing generalization from patient-level memorization.

In practice

Implement privacy evaluation tests for EHR models.
Prioritize protection for patients with unique conditions.
Differentiate data leakage severity.

Topics

AI Memorization
Clinical AI
Patient Privacy
Electronic Health Records
Foundation Models

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Researcher, AI Security Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by MIT News - Machine learning.