Orthrus: toward evolutionary and functional RNA foundation models

· Source: Machine learning : nature.com subject feeds · Field: Science & Research — Life Sciences & Biology, Mathematics & Computational Sciences, Research Methodology & Innovation · Depth: Expert, long

Summary

Orthrus is a novel Mamba-based mature RNA foundation model designed to predict key RNA properties and functions by leveraging biological domain knowledge. Unlike existing models that adapt textual domain strategies, Orthrus uses a self-supervised contrastive learning objective with biological augmentations. It maximizes embedding similarity between splice isoforms from ten model organisms and orthologous genes across over 400 mammalian species. This training approach creates latent representations that cluster RNA sequences based on functional and evolutionary similarities. Orthrus's mature RNA isoform representations demonstrate superior performance on mRNA property prediction tasks compared to other genomic foundation models, requiring significantly less fine-tuning data. The model also effectively captures the divergent biological functions of individual transcript isoforms, with its code and pretrained models publicly available on GitHub, Zenodo, and Hugging Face.

Key takeaway

For AI Scientists and Machine Learning Engineers developing genomic foundation models, Orthrus demonstrates that incorporating biological domain knowledge through contrastive learning significantly improves performance on RNA property prediction. You should consider adopting similar biologically-informed pretraining strategies to enhance model accuracy and reduce fine-tuning data requirements, especially when working with complex biological sequences like RNA.

Key insights

Orthrus is a Mamba-based RNA foundation model using contrastive learning and biological augmentations for superior RNA property prediction.

Principles

Method

Orthrus is pretrained using a self-supervised contrastive learning objective. It maximizes embedding similarity between splice isoforms from ten model organisms and orthologous genes from 400+ mammalian species, incorporating biological augmentations.

In practice

Topics

Code references

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine learning : nature.com subject feeds.