The Dark Regulome: Disentangling Predictability from Regulation in Genomic Foundation Models
Summary
This study introduces a novel residualization-and-permutation diagnostic to accurately interpret genomic foundation model outputs for the "dark regulome" in high-grade gliomas. Applied to Caduceus-Ph, HyenaDNA, and Enformer across 30,448 dark genome elements at 92 glioma-relevant loci, the diagnostic disentangles sequence predictability from true regulatory influence. It reveals a consistent 10 kb proximal-regulatory horizon and a clear architectural split: language models (Caduceus-Ph, HyenaDNA) share a predictability layer ranking long transposable elements, while Enformer uniquely identifies a regulatory-output layer of short proximal cCREs, with zero top-100 overlap. The analysis also confirms a 3.3x enrichment (p_emp<5x10^-3) for brain cis-eQTLs in top-100 elements, providing validated synaptogenic-locus candidates.
Key takeaway
For research scientists interpreting genomic foundation model outputs, you must apply the residualization-and-permutation diagnostic to avoid misinterpreting sequence predictability as true regulatory function. This diagnostic helps you identify genuinely regulatory elements, especially those within the robust 10 kb proximal-regulatory horizon. Focus your experimental validation efforts on candidates that show regulatory-output layer signals, like those uniquely identified by Enformer, to ensure biological relevance.
Key insights
A diagnostic separates sequence predictability from true regulatory influence in genomic foundation models, revealing distinct functional layers.
Principles
- ISM likelihood scores conflate regulatory function with sequence predictability.
- Cross-architecture agreement alone does not isolate true regulatory signals.
- A 10 kb proximal-regulatory horizon is a robust signal in genomic models.
Method
The residualization-and-permutation diagnostic separates predictability-driven from regulation-driven RIS variance by controlling for nuisance covariates (k-mer entropy, GC content, log element length, log TSS distance) and evaluating against permutation nulls.
In practice
- Apply the diagnostic to separate predictability from regulation.
- Prioritize 10 kb proximal elements for experimental validation.
Topics
- Genomic Foundation Models
- In-silico Mutagenesis
- Dark Regulome
- Glioma Biology
- Gene Regulation
- Noncoding DNA
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.