Large genome model: Open source AI trained on trillions of bases
Summary
Evo 2, an open-source AI system, has been developed and trained on 8.8 trillion base pairs of DNA from all three domains of life: bacteria, archaea, and eukaryotes, including viruses that infect bacteria. This advanced system, built upon a convolutional neural network called StripedHyena 2, can identify complex genomic features such as regulatory DNA, splice sites, and protein-coding regions, even in intricate eukaryotic genomes. It was trained in two stages, initially on 8,000-base chunks for feature identification and then on million-base sequences for large-scale features. Two versions were released: a 7-billion-parameter model trained on 2.4 trillion bases and a 40-billion-parameter model using the full dataset. Evo 2 demonstrates zero-shot prediction capabilities, recognizing mutations and their severity, and even adapting to different genetic codes across species.
Key takeaway
For AI scientists and bioinformaticians working with complex genomic data, Evo 2 offers a powerful, open-source tool for preliminary genome annotation and feature identification. Its ability to perform zero-shot prediction across diverse genomes, including eukaryotes, means you can explore subtle patterns and potentially discover novel genomic elements without task-specific fine-tuning. Consider integrating Evo 2 into your analysis pipelines to accelerate the interpretation of newly sequenced genomes and mutation impact assessments.
Key insights
Evo 2 is an open-source AI trained on trillions of DNA bases to identify complex genomic features across all life domains.
Principles
- Evolutionary conservation indicates functional importance.
- Neural networks excel at subtle pattern recognition.
- Zero-shot prediction avoids bias from fine-tuning.
Method
Evo 2 uses a two-stage training process with StripedHyena 2, first on 8,000-base chunks for feature identification, then on million-base sequences for large-scale genomic features across 8.8 trillion bases.
In practice
- Automated preliminary genome annotation.
- Evaluating single-base mutations for impact.
- Identifying novel genomic features.
Topics
- Large Genome Models
- Evo 2
- Convolutional Neural Networks
- Genome Annotation
- Zero-shot Learning
Best for: AI Scientist, AI Engineer, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI - Ars Technica.