A large-scale unified deep learning model for peptide mass spectrum interpretation trained on multimodal data

· Source: Nature Machine Intelligence · Field: Science & Research — Life Sciences & Biology, Artificial Intelligence & Machine Learning · Depth: Expert, long

Summary

pUniFind is a large-scale multimodal foundational deep learning model designed for peptide mass spectrum interpretation in proteomics. Trained on over 100 million open search-derived spectra, pUniFind unifies end-to-end peptide-spectrum scoring with zero-shot de novo sequencing by aligning spectral and peptide modalities through cross-modality prediction and other pretraining tasks. The model significantly outperforms traditional engines, showing a 42.6% increase in identified peptides in immunopeptidomics. For modification-rich de novo sequencing, pUniFind identifies 60% more peptide–spectrum matches despite a 300 times larger search space. It also recovers an additional 38.5% of peptides in regular de novo sequencing, including 1,891 peptides absent from reference proteomes but mapping to the genome. A deep learning-derived quality control module further boosts consistency with RNA-Seq evidence from 65.4% to 85.0%.

Key takeaway

For proteomic researchers and bioinformaticians analyzing complex mass spectrometry data, pUniFind offers a powerful new tool to enhance peptide identification and de novo sequencing. You should consider integrating pUniFind into your workflows, especially for studies involving immunopeptidomics or novel post-translational modifications, to uncover previously missed peptides and improve data consistency with genomic evidence. This can accelerate discovery in areas like biomarker identification and drug target validation.

Key insights

pUniFind unifies peptide-spectrum scoring and de novo sequencing, significantly improving proteomic analysis sensitivity and modification coverage.

Principles

Method

pUniFind integrates end-to-end peptide-spectrum scoring with zero-shot de novo sequencing, trained on 100M+ spectra using cross-modality prediction and pretraining tasks. It offers two de novo workflows.

In practice

Topics

Code references

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Nature Machine Intelligence.