Constraint-based Pre-training: From Structured Constraints to Scalable Model Initialization
Summary
A novel constraint-based pre-training paradigm is proposed to address the limitation of conventional pre-training, which typically produces models at a fixed scale. This new approach imposes structured constraints during pre-training to separate size-agnostic knowledge into reusable weight templates. Size-specific adaptation is then handled by lightweight weight scalers, reframing variable-sized model initialization as a multi-task adaptation problem. Within this paradigm, a method called WeiT is introduced, which utilizes Kronecker-based constraints to regularize the pre-training process. WeiT represents model parameters as compositions of weight templates through concatenation and weighted aggregation, with adaptive connections managed by lightweight weight scalers learned from limited data. This design facilitates the flexible and efficient construction of model weights for diverse downstream scales. Experiments show WeiT achieves state-of-the-art performance in initializing models with varying depths and widths across tasks like Image Classification, Image Generation, and Embodied Control, and is effective for both Transformer-based and Convolution-based architectures, leading to faster convergence and improved performance.
Key takeaway
For research scientists developing large-scale models, consider adopting constraint-based pre-training paradigms like WeiT to enable more flexible and efficient model initialization across diverse scales. This approach can significantly improve convergence speed and performance, even with full training, by separating core knowledge from size-specific adaptations, thereby streamlining the deployment of models with varying computational requirements.
Key insights
Constraint-based pre-training disentangles size-agnostic knowledge from size-specific adaptation for scalable model initialization.
Principles
- Disentangle size-agnostic and size-specific knowledge.
- Reformulate variable-sized initialization as multi-task adaptation.
Method
WeiT employs Kronecker-based constraints to represent model parameters as compositions of weight templates and lightweight weight scalers for adaptive connections.
In practice
- Initialize models with varying depths and widths.
- Apply to Transformer-based and Convolution-based architectures.
Topics
- Constraint-based Pre-training
- WeiT
- Weight Templates
- Lightweight Weight Scalers
- Variable-sized Model Initialization
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.