HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens
Summary
HD-Prot is a novel hybrid diffusion protein language model designed for joint sequence-structure modeling, addressing the limitation of existing methods that discretize protein structures and lose fine-grained information. Developed by researchers from The Hong Kong Polytechnic University, BioGen AI, and Mohamed bin Zayed University of Artificial Intelligence, HD-Prot integrates a continuous-valued diffusion head onto a discrete protein language model. This enables seamless operation with both discrete sequence tokens and continuous structure tokens, which are high-fidelity protein structure latents that avoid vector quantization. The model captures inter-token dependencies across modalities through a unified absorbing diffusion process. HD-Prot achieves competitive performance in unconditional sequence-structure co-generation, motif-scaffolding, protein structure prediction, and inverse folding tasks, performing comparably to leading multimodal pLMs while requiring significantly fewer computational resources, specifically training on 1-2 GPUs at less than one-tenth the cost of DPLM-2.
Key takeaway
For AI Scientists or Machine Learning Engineers developing multimodal protein generative models, you should consider adopting continuous structure token approaches like HD-Prot. This method preserves fine-grained structural details lost in traditional discrete quantization, potentially improving model accuracy in tasks like protein structure prediction and inverse folding. Furthermore, HD-Prot's demonstrated computational efficiency, requiring significantly fewer GPUs and lower training costs, makes it a practical alternative for resource-constrained development, enabling you to achieve competitive performance without extensive hardware investments.
Key insights
A hybrid diffusion pLM effectively integrates continuous protein structure tokens to overcome discretization's information loss.
Principles
- Protein sequence and structure, though correlated, require distinct discrete and continuous data modalities.
- Discretizing continuous protein structures for pLMs inherently sacrifices fine-grained structural details.
- Continuous tokens offer superior fidelity for representing complex, fine-grained structural knowledge in multimodal models.
Method
HD-Prot employs a non-quantized autoencoder (salad tokenizer) for continuous structure tokens, then uses unified absorbing diffusion language modeling for inter-token dependencies, with separate categorical prediction for sequences and continuous diffusion for structures.
In practice
- Training HD-Prot on 1-2 GPUs is computationally efficient, reducing costs significantly.
- Classifier-free guidance enhances consistency when generating continuous structure tokens.
- Strategic fine-tuning (e.g., LoRA for larger models) prevents sequence knowledge forgetting during modality expansion.
Topics
- Protein Language Models
- Continuous Tokens
- Diffusion Models
- Protein Structure Prediction
- Inverse Folding
- Motif Scaffolding
Code references
- EchoChou990919/hdprot
- jasonkyuyim/multiflow
- NVIDIA-Digital-Bio/la-proteina
- bytedance/dplm
- evolutionaryscale/esm
Best for: AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.