Pre-Training Isn’t Bitter Enough

2026-06-17 · Source: Machine Learning Blog | ML@CMU | Carnegie Mellon University · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, medium

Summary

Value-based pre-training with downstream feedback (V-pretraining), developed at CMU and published June 17, 2026, introduces a novel approach to continued pre-training by making the self-supervised task construction learnable. Unlike standard methods where the pre-training objective is fixed, V-pretraining employs a lightweight task designer that uses a small set of verifiable downstream examples to determine which self-supervised prediction problems are most useful. The learner, a foundation model, continues to update solely on unlabeled data via a self-supervised loss. The designer is trained to construct tasks whose learner gradients ($g_{ m pre}$) align with downstream gradients ($g_{ m down}$), estimated by their inner product $g_{ m down}^{ op} g_{ m pre}$. This method demonstrated significant performance improvements: Qwen2.5-0.5B's GSM8K Pass@1 increased from 22.20 to 29.60, and DINOv3-ViT-L's ADE20K mIoU improved from 51.33 to 52.47.

Key takeaway

For Machine Learning Engineers optimizing foundation model pre-training, consider integrating V-pretraining to dynamically adapt self-supervised tasks. By using a task designer to align pre-training gradients with small downstream feedback batches, you can achieve better performance on target tasks like GSM8K or ADE20K without direct supervision. This approach makes pre-training more "bitter" by learning what to predict, potentially reducing the need for extensive manual objective tuning.

Key insights

V-pretraining learns optimal self-supervised tasks using downstream feedback, making pre-training more adaptive and effective.

Principles

Scalable methods that learn from data outperform hand-designed structures.
Feedback should shape task construction, not directly supervise the learner.
Aligning self-supervised and downstream gradients improves pre-training.

Method

V-pretraining trains a task designer to construct self-supervised tasks whose learner gradients ($g_{ m pre}$) align with downstream gradients ($g_{ m down}$), estimated by $g_{ m down}^{ op} g_{ m pre}$, without directly supervising the learner.

In practice

Use adaptive top-K soft target construction for language models.
Apply self-supervised view construction for vision models.
Integrate small downstream feedback batches during continued pre-training.

Topics

V-pretraining
Self-supervised Learning
Foundation Models
Task Design
Gradient Alignment
Model Pre-training

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning Blog | ML@CMU | Carnegie Mellon University.