HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Science & Research — Artificial Intelligence & Machine Learning, Life Sciences & Biology, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

HD-Prot is a novel hybrid diffusion protein language model designed for joint sequence-structure modeling, addressing the limitation of existing methods that discretize protein structures and lose fine-grained information. Developed by researchers from The Hong Kong Polytechnic University, BioGen AI, and Mohamed bin Zayed University of Artificial Intelligence, HD-Prot integrates a continuous-valued diffusion head onto a discrete protein language model. This enables seamless operation with both discrete sequence tokens and continuous structure tokens, which are high-fidelity protein structure latents that avoid vector quantization. The model captures inter-token dependencies across modalities through a unified absorbing diffusion process. HD-Prot achieves competitive performance in unconditional sequence-structure co-generation, motif-scaffolding, protein structure prediction, and inverse folding tasks, performing comparably to leading multimodal pLMs while requiring significantly fewer computational resources, specifically training on 1-2 GPUs at less than one-tenth the cost of DPLM-2.

Key takeaway

For AI Scientists or Machine Learning Engineers developing multimodal protein generative models, you should consider adopting continuous structure token approaches like HD-Prot. This method preserves fine-grained structural details lost in traditional discrete quantization, potentially improving model accuracy in tasks like protein structure prediction and inverse folding. Furthermore, HD-Prot's demonstrated computational efficiency, requiring significantly fewer GPUs and lower training costs, makes it a practical alternative for resource-constrained development, enabling you to achieve competitive performance without extensive hardware investments.

Key insights

A hybrid diffusion pLM effectively integrates continuous protein structure tokens to overcome discretization's information loss.

Principles

Protein sequence and structure, though correlated, require distinct discrete and continuous data modalities.
Discretizing continuous protein structures for pLMs inherently sacrifices fine-grained structural details.
Continuous tokens offer superior fidelity for representing complex, fine-grained structural knowledge in multimodal models.

Method

HD-Prot employs a non-quantized autoencoder (salad tokenizer) for continuous structure tokens, then uses unified absorbing diffusion language modeling for inter-token dependencies, with separate categorical prediction for sequences and continuous diffusion for structures.

In practice

Training HD-Prot on 1-2 GPUs is computationally efficient, reducing costs significantly.
Classifier-free guidance enhances consistency when generating continuous structure tokens.
Strategic fine-tuning (e.g., LoRA for larger models) prevents sequence knowledge forgetting during modality expansion.

Topics

Protein Language Models
Continuous Tokens
Diffusion Models
Protein Structure Prediction
Inverse Folding
Motif Scaffolding

Code references

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.