[R] Genomic Large Language Models

· Source: Machine Learning · Field: Science & Research — Life Sciences & Biology, Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Intermediate, quick

Summary

Arc Institute's genomic foundation model, Evo2, trained on 9.3 trillion nucleotides, was explored to assess its ability to capture biological relationships beyond simple sequence similarity. Researchers extracted embeddings from Evo2's intermediate layers for 512bp windows across 25 human genes and compared these against BLAST, a standard sequence alignment tool. While many strong matches were initially driven by common repeat elements like Alu, stricter filtering revealed a significant pairing: a section of the VIM gene (chr10) and a section of the DES gene (chr2) showed high embedding similarity (cosine = 0.948) despite lacking detectable sequence alignment. Both regions are active promoters in muscle and connective tissue, share regulatory proteins, and originate from related, co-expressed genes, suggesting Evo2 can recognize gene regulation patterns.

Key takeaway

For AI Scientists exploring genomic data, Evo2 demonstrates a promising capability to identify functional relationships in DNA that traditional sequence alignment tools miss. While the signal currently requires heavy filtering to distinguish from noise, your team should consider integrating genomic LLM embeddings into early-stage research to uncover non-obvious regulatory connections between genes, potentially accelerating discovery in gene function and disease mechanisms.

Key insights

Genomic LLMs can identify functional similarities in DNA beyond sequence alignment, indicating learned regulatory patterns.

Principles

Method

Extract embeddings from a genomic LLM for DNA windows, then compare these embeddings to identify similarities not detectable by traditional sequence alignment tools like BLAST.

In practice

Topics

Best for: AI Scientist, AI Researcher, Research Scientist, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.