Evaluating the role of pretraining dataset size and diversity on single-cell foundation model performance
Summary
A study investigated the impact of pretraining dataset size and diversity on single-cell foundation model (scFM) performance, building on the success of transformer models in other domains. Researchers pretrained 400 models using a corpus of 22.2 million cells and conducted 6,400 experiments to evaluate their performance on zero-shot and fine-tuned tasks, including cell classification, batch integration, and perturbation response prediction. The findings indicate that, unlike large language models, scFMs do not exhibit clear data scaling laws. Instead, performance tends to plateau with pretraining datasets that are only a fraction of the total available data, suggesting that simply increasing dataset size indiscriminately does not yield proportional gains.
Key takeaway
For AI Scientists and Research Scientists developing single-cell foundation models, you should re-evaluate strategies focused solely on massive data scaling. Your efforts will be more effective by optimizing the balance between model capacity, dataset size, and computational resources, rather than just accumulating more data. Consider investing in diverse, high-quality datasets and efficient model architectures to achieve performance gains without excessive computational overhead.
Key insights
Single-cell foundation models do not show clear data scaling laws, unlike large language models.
Principles
- scFM performance plateaus at a fraction of available pretraining data.
- Indiscriminate data scaling does not improve scFM performance.
Method
Researchers pretrained 400 single-cell foundation models on a 22.2 million cell corpus, then evaluated them across 6,400 experiments on zero-shot and fine-tuned tasks.
In practice
- Balance model capacity, dataset size, and computational resources.
- Prioritize data quality and diversity over sheer volume for scFMs.
Topics
- Single-cell Foundation Models
- Pretraining Data Scaling
- Transcriptomic Datasets
- Model Performance Plateau
- Computational Biology
- Zero-shot Learning
Code references
- ArcInstitute/arc-virtual-cell-atlas
- microsoft/scFM-dataselection
- theislab/ssl_in_scg
- Genentech/scimilarity
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine learning : nature.com subject feeds.