Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery

2026-06-08 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, quick

Summary

Pretrained biomedical language models (BioBERT, PubMedBERT, BioM-ELECTRA) exhibit a significant flaw by incorrectly assigning high cosine similarity (0.76-0.92) to unrelated cross-domain pairs, resulting in 0% accuracy on cross-domain discrimination. This issue is critical for Large Behavioural Models (LBMs) that infer causal links from embedding proximity. The proposed solution involves a contrastive training pass over 72,034 pairs, which boosts PubMedBERT BIOSSES correlation from 0.633 to 0.828 and within-vs-across-domain separation from 1.05x to 1.63x. A second pass, BODHI, mines hard negatives from a biomedical knowledge graph, further increasing separation to 2.30x and the discrimination gap to +0.392, at a 4.5% BIOSSES cost. Performance optimization on an Intel Xeon 6737P with AMX using OpenVINO reduced single-query latency from 1367 ms to 10 ms (133x) and achieved 555 sentences/sec. Notably, FP16 outperformed INT8 on this silicon. The authors release the benchmark suite, training corpora, the BODHI generator, and OpenVINO scripts.

Key takeaway

For AI Scientists developing causal inference systems or Large Behavioural Models, relying solely on off-the-shelf biomedical encoders risks propagating false causal links due to high cross-domain embedding similarity. You should implement contrastive training, potentially using hard negative mining from knowledge graphs, to ensure accurate embedding separation. Consider optimizing inference with OpenVINO on Intel Xeon 6737P with AMX, noting that FP16 may yield better performance than INT8 on this hardware.

Key insights

Pretrained biomedical language models struggle with cross-domain causal inference due to false embedding proximity, requiring specialized contrastive training.

Principles

Embedding geometry is critical for causal correctness.
Cross-domain discrimination needs explicit training.
FP16 can outperform INT8 on specific silicon.

Method

A two-pass contrastive training method, including BODHI for mining hard negatives from knowledge graphs, significantly improves cross-domain embedding separation and causal discrimination.

In practice

Use contrastive training to improve cross-domain embedding.
Mine hard negatives from knowledge graphs.
Optimize inference with OpenVINO on Intel Xeon AMX.

Topics

Causal Discovery
Biomedical Language Models
Contrastive Learning
Embedding Separation
OpenVINO Performance
Large Behavioural Models

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.