🔬ESM: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub
Summary
Biohub's Head of Science, Alex Rives, details the development and capabilities of their new protein language models, ESMC-6B and ESMFold2, which embody the "Bitter Lesson" of scaling laws in biology. ESMC-6B, a 6 billion parameter model, was trained on 6.8 billion non-redundant protein sequences, including extensive metagenomic data, to create a "world model" of protein biology. This approach has enabled the prediction of 1.1 billion protein structures and the successful design of protein binders and single-chain antibodies (SCFVs) with therapeutic affinity. Through mechanistic interpretability using sparse autoencoders, the model reveals a hierarchical feature space mirroring established biological principles. Rives also outlines Biohub's Virtual Biology Initiative, a \$500 million commitment (\$400M internal, \$100M external) to accelerate science through scaled data generation, predictive digital representations, and experimental feedback loops, aiming to bridge molecular and cellular understanding.
Key takeaway
For AI Scientists and Machine Learning Engineers developing biological models, Biohub's ESMC-6B demonstrates that scaling data, particularly noisy metagenomic sequences, is paramount for achieving emergent capabilities in protein understanding and design. You should prioritize integrating diverse, large-scale biological datasets to overcome data limitations and unlock advanced functionalities like therapeutic antibody design. Explore ESMC's open-source release to inform your model architectures and data strategies.
Key insights
Scaling protein language models with vast, diverse data reveals emergent biological capabilities and enables programmable design.
Principles
- Increasing data and parameters yields emergent biological capabilities.
- Metagenomic data is critical for general protein representation.
- Statistical patterns in amino acid contexts reflect underlying biology.
Method
ESMC uses masked language modeling on protein sequences, then sparse autoencoders for mechanistic interpretability, and inverts the model for protein design.
In practice
- Design protein binders and single-chain antibodies (SCFVs).
- Identify novel gene editing systems and functional motifs.
Topics
- Protein Language Models
- ESMC-6B
- Programmable Biology
- Antibody Design
- Metagenomics
- Mechanistic Interpretability
- Biohub Virtual Biology Initiative
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Latent.Space - Www.latent.space.