ResNets of All Shapes and Sizes: Convergence of Training Dynamics in the Large-scale Limit

2026-03-20 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new study establishes the convergence of training dynamics for Residual Neural Networks (ResNets) to a joint infinite-scale limit across depth (L), hidden width (M), and embedding dimension (D). The research focuses on ResNets with two-layer perceptron blocks operating in the maximal local feature update (MLU) regime. After a bounded number of training steps, the error between the ResNet and its large-scale limit is proven to be O(1/L + sqrt(D/(L M)) + 1/sqrt(D)). For a parameter budget of P = Theta(L M D), this translates to a convergence rate of O(P^(-1/6)) when (L, M, D) scalings are optimized. This analysis, which leverages the depth-two structure of residual blocks, is applicable to various state-of-the-art architectures, including Transformers with bounded key-query dimensions.

Key takeaway

For AI Researchers and Scientists designing or scaling ResNet architectures, understanding the O(P^(-1/6)) convergence rate for optimal (L, M, D) scalings is crucial. This provides a theoretical basis for predicting training behavior and optimizing parameter allocation, especially when working with large-scale models or architectures like Transformers, by offering a quantitative bound on the error to the infinite-scale limit.

Key insights

ResNet training dynamics converge to a quantifiable large-scale limit across depth, width, and embedding dimension.

Principles

MLU regime enables rigorous convergence analysis.
Depth-two block structure is key for broad applicability.

Method

The analysis combines the cavity method with propagation of chaos arguments on skeleton maps to rigorously quantify convergence for a DMFT-type limit, extending previous work on fixed embedding dimension.

In practice

Error rate O(P^(-1/6)) guides ResNet scaling.
Applies to Transformers with bounded key-query dimension.

Topics

ResNet Training Dynamics
Large-scale Limits
Convergence Analysis
Mean Field Theory
Transformer Architectures

Best for: AI Researcher, AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.