Large genome model: Open source AI trained on trillions of bases

· Source: AI - Ars Technica · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computational Genomics · Depth: Intermediate, medium

Summary

Evo 2, an open-source AI system, has been developed and trained on 8.8 trillion base pairs of DNA from all three domains of life: bacteria, archaea, and eukaryotes, including viruses that infect bacteria. This advanced system, built upon a convolutional neural network called StripedHyena 2, can identify complex genomic features such as regulatory DNA, splice sites, and protein-coding regions, even in intricate eukaryotic genomes. It was trained in two stages, initially on 8,000-base chunks for feature identification and then on million-base sequences for large-scale features. Two versions were released: a 7-billion-parameter model trained on 2.4 trillion bases and a 40-billion-parameter model using the full dataset. Evo 2 demonstrates zero-shot prediction capabilities, recognizing mutations and their severity, and even adapting to different genetic codes across species.

Key takeaway

For AI scientists and bioinformaticians working with complex genomic data, Evo 2 offers a powerful, open-source tool for preliminary genome annotation and feature identification. Its ability to perform zero-shot prediction across diverse genomes, including eukaryotes, means you can explore subtle patterns and potentially discover novel genomic elements without task-specific fine-tuning. Consider integrating Evo 2 into your analysis pipelines to accelerate the interpretation of newly sequenced genomes and mutation impact assessments.

Key insights

Evo 2 is an open-source AI trained on trillions of DNA bases to identify complex genomic features across all life domains.

Principles

Method

Evo 2 uses a two-stage training process with StripedHyena 2, first on 8,000-base chunks for feature identification, then on million-base sequences for large-scale genomic features across 8.8 trillion bases.

In practice

Topics

Best for: AI Scientist, AI Engineer, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI - Ars Technica.