Biological Reasoning-Informed Regression for Interpretable Regulatory DNA Activity Prediction

2026-06-06 · Source: Machine Learning · Field: Science & Research — Life Sciences & Biology, Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

R3LM is a novel framework designed to enhance the prediction of DNA cis-regulatory element (CRE) activity by integrating large language models (LLMs) with structured biological knowledge. Addressing the limitations of black-box regression methods and direct LLM application to raw DNA, R3LM introduces a biologically grounded data format to improve LLM understanding of regulatory information. It also presents CRE-ReasonBench, the first dataset linking DNA sequences and activity scores with mechanistic reasoning traces. Through a two-stage training process—first teaching LLMs reasoning over structured biological information, then performing regression—R3LM achieves state-of-the-art performance in enhancer prediction across three cell types. This framework surpasses both LLMs using raw sequence input and specialized DNA models, offering interpretable mechanistic explanations and serving as a potential interpretable reward model for biologists in CRE design.

Key takeaway

For biologists and AI scientists developing models for gene regulation or designing cis-regulatory elements, R3LM offers a path to highly accurate and interpretable predictions. You should consider adopting structured biological knowledge and two-stage LLM training to move beyond black-box models. This approach can significantly improve your ability to understand mechanistic explanations and refine CRE designs.

Key insights

Integrating structured biological knowledge with LLMs via two-stage training improves interpretable DNA regulatory activity prediction.

Principles

Structured biological data enhances LLM understanding.
Mechanistic reasoning improves prediction interpretability.
Two-stage training optimizes LLM for specific tasks.

Method

R3LM uses a biologically grounded data format and CRE-ReasonBench for two-stage LLM training: first, reasoning over structured biological information, then performing regression for DNA regulatory activity prediction.

In practice

Design interpretable reward models.
Predict enhancer activity across cell types.
Structure DNA data for LLM input.

Topics

DNA Regulatory Elements
Gene Expression Prediction
Large Language Models
Biological Knowledge Integration
Interpretable AI
Enhancer Prediction

Code references

DuanYi516/R3LM

Best for: AI Scientist, Research Scientist, Domain Expert

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.