ResNets of All Shapes and Sizes: Convergence of Training Dynamics in the Large-scale Limit
Summary
A new study establishes the convergence of training dynamics for Residual Neural Networks (ResNets) to a joint infinite-scale limit across depth (L), hidden width (M), and embedding dimension (D). The research focuses on ResNets with two-layer perceptron blocks operating in the maximal local feature update (MLU) regime. After a bounded number of training steps, the error between the ResNet and its large-scale limit is proven to be O(1/L + sqrt(D/(L M)) + 1/sqrt(D)). For a parameter budget of P = Theta(L M D), this translates to a convergence rate of O(P^(-1/6)) when (L, M, D) scalings are optimized. This analysis, which leverages the depth-two structure of residual blocks, is applicable to various state-of-the-art architectures, including Transformers with bounded key-query dimensions.
Key takeaway
For AI Researchers and Scientists designing or scaling ResNet architectures, understanding the O(P^(-1/6)) convergence rate for optimal (L, M, D) scalings is crucial. This provides a theoretical basis for predicting training behavior and optimizing parameter allocation, especially when working with large-scale models or architectures like Transformers, by offering a quantitative bound on the error to the infinite-scale limit.
Key insights
ResNet training dynamics converge to a quantifiable large-scale limit across depth, width, and embedding dimension.
Principles
- MLU regime enables rigorous convergence analysis.
- Depth-two block structure is key for broad applicability.
Method
The analysis combines the cavity method with propagation of chaos arguments on skeleton maps to rigorously quantify convergence for a DMFT-type limit, extending previous work on fixed embedding dimension.
In practice
- Error rate O(P^(-1/6)) guides ResNet scaling.
- Applies to Transformers with bounded key-query dimension.
Topics
- ResNet Training Dynamics
- Large-scale Limits
- Convergence Analysis
- Mean Field Theory
- Transformer Architectures
Best for: AI Researcher, AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.