Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery
Summary
Pretrained biomedical language models (BioBERT, PubMedBERT, BioM-ELECTRA) incorrectly assign high cosine similarity (0.76-0.92) to causally unrelated cross-domain concepts, resulting in 0% cross-domain discrimination accuracy. This issue is particularly problematic for Large Behavioural Models (LBMs) that rely on embedding proximity to infer causal links, leading to erroneous causal graphs. Researchers developed a two-pass contrastive learning approach to address this. The first pass, using 72,034 pairs, improved PubMedBERT BIOSSES correlation from 0.633 to 0.828 and within-vs-across-domain separation from 1.05x to 1.63x. A second pass, named BODHI, mines hard negatives from biomedical knowledge graphs, further boosting separation to 2.30x and the discrimination gap to +0.392, with a 4.5% BIOSSES cost. Inference on an Intel Xeon 6737P with AMX achieved 10 ms latency (133x faster) and 555 sentences/sec using OpenVINO, with FP16 outperforming INT8.
Key takeaway
For AI Scientists developing Large Behavioural Models or causal discovery systems, accurately distinguishing causally unrelated concepts in embeddings is paramount. Your models may be inferring false causal links from high cosine similarity in off-the-shelf biomedical encoders. Implement contrastive learning, especially with hard negative mining, to significantly improve cross-domain separation. Additionally, optimize inference with OpenVINO on AMX-enabled Intel Xeon processors, noting that FP16 may yield better performance than INT8 for your specific deployment.
Key insights
Biomedical language models require explicit causal embedding separation to prevent false causal links in LBMs.
Principles
- Embedding proximity is critical for causal discovery.
- Hard negative mining improves cross-domain separation.
- FP16 can outperform INT8 on specific silicon.
Method
A two-pass contrastive learning approach: first, a general pass over 72,034 pairs; second, BODHI, which mines hard negatives from biomedical knowledge graphs to enhance cross-domain separation.
In practice
- Apply contrastive learning to improve embedding separation.
- Utilize OpenVINO for 133x faster inference on AMX-enabled CPUs.
- Consider FP16 over INT8 for specific hardware.
Topics
- Causal Discovery
- Biomedical Language Models
- Contrastive Learning
- Embedding Separation
- Large Behavioural Models
- OpenVINO Inference
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.