🔬ESM: The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub

· Source: Latent.Space - Www.latent.space · Field: Science & Research — Life Sciences & Biology, Health & Medical Research, Research Methodology & Innovation · Depth: Advanced, extended

Summary

Biohub's Head of Science, Alex Rives, details the development and capabilities of their new protein language models, ESMC-6B and ESMFold2, which embody the "Bitter Lesson" of scaling laws in biology. ESMC-6B, a 6 billion parameter model, was trained on 6.8 billion non-redundant protein sequences, including extensive metagenomic data, to create a "world model" of protein biology. This approach has enabled the prediction of 1.1 billion protein structures and the successful design of protein binders and single-chain antibodies (SCFVs) with therapeutic affinity. Through mechanistic interpretability using sparse autoencoders, the model reveals a hierarchical feature space mirroring established biological principles. Rives also outlines Biohub's Virtual Biology Initiative, a \$500 million commitment (\$400M internal, \$100M external) to accelerate science through scaled data generation, predictive digital representations, and experimental feedback loops, aiming to bridge molecular and cellular understanding.

Key takeaway

For AI Scientists and Machine Learning Engineers developing biological models, Biohub's ESMC-6B demonstrates that scaling data, particularly noisy metagenomic sequences, is paramount for achieving emergent capabilities in protein understanding and design. You should prioritize integrating diverse, large-scale biological datasets to overcome data limitations and unlock advanced functionalities like therapeutic antibody design. Explore ESMC's open-source release to inform your model architectures and data strategies.

Key insights

Scaling protein language models with vast, diverse data reveals emergent biological capabilities and enables programmable design.

Principles

Method

ESMC uses masked language modeling on protein sequences, then sparse autoencoders for mechanistic interpretability, and inverts the model for protein design.

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Latent.Space - Www.latent.space.