Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery

2026-06-08 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, medium

Summary

Pretrained biomedical language models (BioBERT, PubMedBERT, BioM-ELECTRA) incorrectly assign high cosine similarity (0.76-0.92) to causally unrelated cross-domain concepts, resulting in 0% cross-domain discrimination accuracy. This issue is particularly problematic for Large Behavioural Models (LBMs) that rely on embedding proximity to infer causal links, leading to erroneous causal graphs. Researchers developed a two-pass contrastive learning approach to address this. The first pass, using 72,034 pairs, improved PubMedBERT BIOSSES correlation from 0.633 to 0.828 and within-vs-across-domain separation from 1.05x to 1.63x. A second pass, named BODHI, mines hard negatives from biomedical knowledge graphs, further boosting separation to 2.30x and the discrimination gap to +0.392, with a 4.5% BIOSSES cost. Inference on an Intel Xeon 6737P with AMX achieved 10 ms latency (133x faster) and 555 sentences/sec using OpenVINO, with FP16 outperforming INT8.

Key takeaway

For AI Scientists developing Large Behavioural Models or causal discovery systems, accurately distinguishing causally unrelated concepts in embeddings is paramount. Your models may be inferring false causal links from high cosine similarity in off-the-shelf biomedical encoders. Implement contrastive learning, especially with hard negative mining, to significantly improve cross-domain separation. Additionally, optimize inference with OpenVINO on AMX-enabled Intel Xeon processors, noting that FP16 may yield better performance than INT8 for your specific deployment.

Key insights

Biomedical language models require explicit causal embedding separation to prevent false causal links in LBMs.

Principles

Embedding proximity is critical for causal discovery.
Hard negative mining improves cross-domain separation.
FP16 can outperform INT8 on specific silicon.

Method

A two-pass contrastive learning approach: first, a general pass over 72,034 pairs; second, BODHI, which mines hard negatives from biomedical knowledge graphs to enhance cross-domain separation.

In practice

Apply contrastive learning to improve embedding separation.
Utilize OpenVINO for 133x faster inference on AMX-enabled CPUs.
Consider FP16 over INT8 for specific hardware.

Topics

Causal Discovery
Biomedical Language Models
Contrastive Learning
Embedding Separation
Large Behavioural Models
OpenVINO Inference

Code references

kougioulis/LCM-paper

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.