Viral Proteins Reveal Geometry of Protein Language Models

2026-06-10 · Source: Machine Learning · Field: Science & Research — Artificial Intelligence & Machine Learning, Life Sciences & Biology · Depth: Expert, quick

Summary

Protein language models (pLMs), despite training on highly imbalanced datasets, represent underrepresented biological sequences in a structured manner. A study using viral proteins across ESM model families identified a dominant "nativeness axis" within the embedding space. This axis, aligned with masked reconstruction perplexity, orders sequences from well-modeled cellular proteins through viral proteins to shuffled and random sequences. While scaling unevenly contracts this axis across viral families, pLM embeddings retain viral-specific signals. Viral proteins remain linearly separable beyond zero-shot perplexity and shallow sequence features. These findings indicate that pLM representations are structured by a general concept of nativeness while preserving information specific to distinct biological groups.

Key takeaway

For research scientists interpreting protein language model embeddings, recognize that these models structure sequences along a "nativeness axis" while simultaneously preserving distinct biological group information. You should account for this dual organization when analyzing embedding spaces or designing downstream tasks, as it implies both generalizability and specificity in pLM representations. This understanding is crucial for accurately interpreting model behavior, especially with underrepresented biological data like viral proteins.

Key insights

Protein language models organize sequence embeddings by a "nativeness axis" while preserving specific biological group signals.

Principles

pLM representations are structured by a general notion of nativeness.
Viral proteins remain linearly separable beyond shallow sequence features.

Topics

Protein Language Models
Viral Proteins
ESM Models
Protein Embeddings
Sequence Representation
Biological Sequences
Nativeness Axis

Best for: AI Scientist, Research Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.