Pre-Training Isn’t Bitter Enough
Summary
Value-based pre-training with downstream feedback (V-pretraining), developed at CMU and published June 17, 2026, introduces a novel approach to continued pre-training by making the self-supervised task construction learnable. Unlike standard methods where the pre-training objective is fixed, V-pretraining employs a lightweight task designer that uses a small set of verifiable downstream examples to determine which self-supervised prediction problems are most useful. The learner, a foundation model, continues to update solely on unlabeled data via a self-supervised loss. The designer is trained to construct tasks whose learner gradients ($g_{ m pre}$) align with downstream gradients ($g_{ m down}$), estimated by their inner product $g_{ m down}^{ op} g_{ m pre}$. This method demonstrated significant performance improvements: Qwen2.5-0.5B's GSM8K Pass@1 increased from 22.20 to 29.60, and DINOv3-ViT-L's ADE20K mIoU improved from 51.33 to 52.47.
Key takeaway
For Machine Learning Engineers optimizing foundation model pre-training, consider integrating V-pretraining to dynamically adapt self-supervised tasks. By using a task designer to align pre-training gradients with small downstream feedback batches, you can achieve better performance on target tasks like GSM8K or ADE20K without direct supervision. This approach makes pre-training more "bitter" by learning what to predict, potentially reducing the need for extensive manual objective tuning.
Key insights
V-pretraining learns optimal self-supervised tasks using downstream feedback, making pre-training more adaptive and effective.
Principles
- Scalable methods that learn from data outperform hand-designed structures.
- Feedback should shape task construction, not directly supervise the learner.
- Aligning self-supervised and downstream gradients improves pre-training.
Method
V-pretraining trains a task designer to construct self-supervised tasks whose learner gradients ($g_{ m pre}$) align with downstream gradients ($g_{ m down}$), estimated by $g_{ m down}^{ op} g_{ m pre}$, without directly supervising the learner.
In practice
- Use adaptive top-K soft target construction for language models.
- Apply self-supervised view construction for vision models.
- Integrate small downstream feedback batches during continued pre-training.
Topics
- V-pretraining
- Self-supervised Learning
- Foundation Models
- Task Design
- Gradient Alignment
- Model Pre-training
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning Blog | ML@CMU | Carnegie Mellon University.