[R] Genomic Large Language Models
Summary
Arc Institute's genomic foundation model, Evo2, trained on 9.3 trillion nucleotides, was explored to assess its ability to capture biological relationships beyond simple sequence similarity. Researchers extracted embeddings from Evo2's intermediate layers for 512bp windows across 25 human genes and compared these against BLAST, a standard sequence alignment tool. While many strong matches were initially driven by common repeat elements like Alu, stricter filtering revealed a significant pairing: a section of the VIM gene (chr10) and a section of the DES gene (chr2) showed high embedding similarity (cosine = 0.948) despite lacking detectable sequence alignment. Both regions are active promoters in muscle and connective tissue, share regulatory proteins, and originate from related, co-expressed genes, suggesting Evo2 can recognize gene regulation patterns.
Key takeaway
For AI Scientists exploring genomic data, Evo2 demonstrates a promising capability to identify functional relationships in DNA that traditional sequence alignment tools miss. While the signal currently requires heavy filtering to distinguish from noise, your team should consider integrating genomic LLM embeddings into early-stage research to uncover non-obvious regulatory connections between genes, potentially accelerating discovery in gene function and disease mechanisms.
Key insights
Genomic LLMs can identify functional similarities in DNA beyond sequence alignment, indicating learned regulatory patterns.
Principles
- Embeddings can capture functional similarity.
- Sequence alignment detects raw DNA similarity.
Method
Extract embeddings from a genomic LLM for DNA windows, then compare these embeddings to identify similarities not detectable by traditional sequence alignment tools like BLAST.
In practice
- Use Evo2 to find regulatory patterns.
- Filter for meaningful biological signals.
Topics
- Genomic Large Language Models
- Evo2 Model
- Gene Regulation
- Biological Embeddings
- Sequence Alignment
Best for: AI Scientist, AI Researcher, Research Scientist, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.