Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery
Summary
Pretrained biomedical language models (BioBERT, PubMedBERT, BioM-ELECTRA) exhibit a significant flaw by incorrectly assigning high cosine similarity (0.76-0.92) to unrelated cross-domain pairs, resulting in 0% accuracy on cross-domain discrimination. This issue is critical for Large Behavioural Models (LBMs) that infer causal links from embedding proximity. The proposed solution involves a contrastive training pass over 72,034 pairs, which boosts PubMedBERT BIOSSES correlation from 0.633 to 0.828 and within-vs-across-domain separation from 1.05x to 1.63x. A second pass, BODHI, mines hard negatives from a biomedical knowledge graph, further increasing separation to 2.30x and the discrimination gap to +0.392, at a 4.5% BIOSSES cost. Performance optimization on an Intel Xeon 6737P with AMX using OpenVINO reduced single-query latency from 1367 ms to 10 ms (133x) and achieved 555 sentences/sec. Notably, FP16 outperformed INT8 on this silicon. The authors release the benchmark suite, training corpora, the BODHI generator, and OpenVINO scripts.
Key takeaway
For AI Scientists developing causal inference systems or Large Behavioural Models, relying solely on off-the-shelf biomedical encoders risks propagating false causal links due to high cross-domain embedding similarity. You should implement contrastive training, potentially using hard negative mining from knowledge graphs, to ensure accurate embedding separation. Consider optimizing inference with OpenVINO on Intel Xeon 6737P with AMX, noting that FP16 may yield better performance than INT8 on this hardware.
Key insights
Pretrained biomedical language models struggle with cross-domain causal inference due to false embedding proximity, requiring specialized contrastive training.
Principles
- Embedding geometry is critical for causal correctness.
- Cross-domain discrimination needs explicit training.
- FP16 can outperform INT8 on specific silicon.
Method
A two-pass contrastive training method, including BODHI for mining hard negatives from knowledge graphs, significantly improves cross-domain embedding separation and causal discrimination.
In practice
- Use contrastive training to improve cross-domain embedding.
- Mine hard negatives from knowledge graphs.
- Optimize inference with OpenVINO on Intel Xeon AMX.
Topics
- Causal Discovery
- Biomedical Language Models
- Contrastive Learning
- Embedding Separation
- OpenVINO Performance
- Large Behavioural Models
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.