Representation learning to advance multi-institutional studies with electronic health record data from US and France
Summary
A new graph-based framework addresses the challenge of harmonizing fragmented and heterogeneously coded electronic health record (EHR) data across privacy-siloed institutions. This framework treats data harmonization as a scalable representation learning problem, integrating institution-specific summary statistics from health records, curated biomedical knowledge graphs, and semantic information derived from large language models. The joint learning approach creates a shared semantic space, aligning diverse, site-specific vocabularies while preserving patient privacy. Evaluated across seven institutions and two languages, this framework provides a robust, data-centric foundation for training and deploying clinical models across disparate healthcare systems. The pairwise cosine similarity data derived from the GAME embeddings are publicly available, while institution-level summary data require restricted access via data use agreements.
Key takeaway
For AI Scientists and Machine Learning Engineers developing clinical models across multiple healthcare systems, this framework offers a solution to data heterogeneity and privacy concerns. You should explore integrating this graph-based representation learning approach to harmonize diverse EHR vocabularies, enabling more robust model training and deployment. Consider leveraging the publicly available GAME embeddings for initial investigations into cross-institutional data alignment.
Key insights
A graph-based framework harmonizes heterogeneous EHR data across institutions using representation learning and privacy-preserving methods.
Principles
- Integrate diverse data sources for semantic alignment.
- Preserve patient privacy through collaborative learning.
- Treat data harmonization as a representation learning task.
Method
The framework integrates institution-specific EHR summary statistics, biomedical knowledge graphs, and large language model semantic information to learn a shared semantic space, aligning diverse vocabularies without sharing patient-level data.
In practice
- Access GAME embeddings for public cosine similarity data.
- Request DUAs for institution-level summary data access.
Topics
- Representation Learning
- Electronic Health Records
- Data Harmonization
- Privacy-Preserving Learning
- Biomedical Knowledge Graphs
Code references
Best for: AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine learning : nature.com subject feeds.