Representation learning to advance multi-institutional studies with electronic health record data from US and France

2026-04-03 · Source: Machine learning : nature.com subject feeds · Field: Health & Wellbeing — Health & Medical Research, Medical Devices & Health Technology, Healthcare Systems & Policy · Depth: Expert, long

Summary

A new graph-based framework addresses the challenge of harmonizing fragmented and heterogeneously coded electronic health record (EHR) data across privacy-siloed institutions. This framework treats data harmonization as a scalable representation learning problem, integrating institution-specific summary statistics from health records, curated biomedical knowledge graphs, and semantic information derived from large language models. The joint learning approach creates a shared semantic space, aligning diverse, site-specific vocabularies while preserving patient privacy. Evaluated across seven institutions and two languages, this framework provides a robust, data-centric foundation for training and deploying clinical models across disparate healthcare systems. The pairwise cosine similarity data derived from the GAME embeddings are publicly available, while institution-level summary data require restricted access via data use agreements.

Key takeaway

For AI Scientists and Machine Learning Engineers developing clinical models across multiple healthcare systems, this framework offers a solution to data heterogeneity and privacy concerns. You should explore integrating this graph-based representation learning approach to harmonize diverse EHR vocabularies, enabling more robust model training and deployment. Consider leveraging the publicly available GAME embeddings for initial investigations into cross-institutional data alignment.

Key insights

A graph-based framework harmonizes heterogeneous EHR data across institutions using representation learning and privacy-preserving methods.

Principles

Integrate diverse data sources for semantic alignment.
Preserve patient privacy through collaborative learning.
Treat data harmonization as a representation learning task.

Method

The framework integrates institution-specific EHR summary statistics, biomedical knowledge graphs, and large language model semantic information to learn a shared semantic space, aligning diverse vocabularies without sharing patient-level data.

In practice

Access GAME embeddings for public cosine similarity data.
Request DUAs for institution-level summary data access.

Topics

Representation Learning
Electronic Health Records
Data Harmonization
Privacy-Preserving Learning
Biomedical Knowledge Graphs

Code references

celehs/GAME

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine learning : nature.com subject feeds.