Generalization in Nonlinear Least Squares via Learned Feature Geometry
Summary
The paper "Generalization in Nonlinear Least Squares via Learned Feature Geometry" investigates generalization in ridge-regularized nonlinear least-squares models. It derives error bounds for local minimizers using on-average algorithmic stability, introducing a data-dependent effective dimension. This dimension captures the gradient model's geometry at trained parameters, incorporating the empirical Jacobian Gram matrix and a residual-curvature term. Unlike neural tangent kernel analyses, this effective dimension is evaluated at the trained model. The authors further bound this dimension through gradient feature covering complexity, providing guarantees based on learned geometry rather than parameter count. For manifold-supported data, bounds scale with intrinsic dimension, and for one-hidden-layer ReLU networks, the mechanism involves activation-stable regions. Experiments confirm trained-Jacobian compression and the bounds' agreement with observed generalization gaps. The derivation relies on first principles using the Brascamp--Lieb inequality.
Key takeaway
For AI Scientists evaluating nonlinear models, this research suggests moving beyond simple parameter counts for generalization guarantees. Your analysis should consider data-dependent effective dimensions that reflect learned feature geometry, such as the empirical Jacobian Gram matrix and residual-curvature terms. This approach offers a more nuanced understanding of model performance, especially for complex architectures like ReLU networks, by linking generalization directly to the data's intrinsic dimension and activation stability.
Key insights
New generalization bounds for nonlinear least-squares models depend on learned feature geometry and data-dependent effective dimension, not just parameter count.
Principles
- Generalization bounds can reflect learned geometry.
- Effective dimension can be data-dependent.
- Algorithmic stability yields error bounds.
Method
Error bounds are derived via on-average algorithmic stability, defining an effective dimension from the empirical Jacobian Gram matrix and a residual-curvature term. This dimension is bounded using gradient feature covering complexity.
Topics
- Generalization Theory
- Nonlinear Least Squares
- Algorithmic Stability
- Effective Dimension
- Jacobian Matrix
- ReLU Networks
Best for: Research Scientist, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.