The Implicit Bias of Logit Regularization
Summary
This work analyzes logit regularization, including label smoothing, in linear classifiers, demonstrating that it induces an implicit bias of "logit clustering" around finite per-sample targets. Unlike unregularized methods that maximize margins, logit regularization drives logits to cluster, which, for Gaussian data or quadratic per-sample loss, aligns the weight vector with Fisher's Linear Discriminant. The study shows that logit regularization halves the critical sample complexity and induces grokking in the small-noise limit, while making generalization robust to orthogonal noise. These findings extend the theoretical understanding of label smoothing and highlight the efficacy of a broader class of logit-regularization methods, validated numerically on Gaussian data and ResNet-18 embeddings from CIFAR-10.
Key takeaway
Research scientists developing or deploying linear classifiers should consider implementing convex logit regularization. This technique can significantly improve generalization, especially in high-dimensional settings where unregularized models might overfit, and offers robustness against orthogonal noise. You should anticipate a shift in the interpolation threshold and potential grokking dynamics, which can be managed by adjusting regularization strength.
Key insights
Logit regularization shifts classifier optimization from margin maximization to logit clustering, aligning weights with Fisher's Linear Discriminant.
Principles
- Logit regularization creates finite per-sample loss minima.
- Generalization accuracy is largely insensitive to regularizer's form.
- Optimal generalization accuracy is invariant to orthogonal noise scale.
Method
The proposed method involves adding a convex penalty directly in logit space, which transforms the optimization objective into logit clustering around finite per-sample targets, contrasting with margin maximization.
In practice
- Use logit regularization to improve generalization and calibration.
- Expect grokking dynamics in weakly regularized, high-dimensional settings.
- Logit regularization enhances robustness to orthogonal noise.
Topics
- Logit Regularization
- Implicit Bias
- Label Smoothing
- Fisher's Linear Discriminant
- Grokking
Best for: Research Scientist, AI Researcher, AI Scientist, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.