Viral Proteins Reveal Geometry of Protein Language Models
Summary
Protein language models (pLMs), despite training on highly imbalanced datasets, represent underrepresented biological sequences in a structured manner. A study using viral proteins across ESM model families identified a dominant "nativeness axis" within the embedding space. This axis, aligned with masked reconstruction perplexity, orders sequences from well-modeled cellular proteins through viral proteins to shuffled and random sequences. While scaling unevenly contracts this axis across viral families, pLM embeddings retain viral-specific signals. Viral proteins remain linearly separable beyond zero-shot perplexity and shallow sequence features. These findings indicate that pLM representations are structured by a general concept of nativeness while preserving information specific to distinct biological groups.
Key takeaway
For research scientists interpreting protein language model embeddings, recognize that these models structure sequences along a "nativeness axis" while simultaneously preserving distinct biological group information. You should account for this dual organization when analyzing embedding spaces or designing downstream tasks, as it implies both generalizability and specificity in pLM representations. This understanding is crucial for accurately interpreting model behavior, especially with underrepresented biological data like viral proteins.
Key insights
Protein language models organize sequence embeddings by a "nativeness axis" while preserving specific biological group signals.
Principles
- pLM representations are structured by a general notion of nativeness.
- Viral proteins remain linearly separable beyond shallow sequence features.
Topics
- Protein Language Models
- Viral Proteins
- ESM Models
- Protein Embeddings
- Sequence Representation
- Biological Sequences
- Nativeness Axis
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.