The Implicit Bias of Logit Regularization

2026-02-13 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

This work analyzes logit regularization, including label smoothing, in linear classifiers, demonstrating that it induces an implicit bias of "logit clustering" around finite per-sample targets. Unlike unregularized methods that maximize margins, logit regularization drives logits to cluster, which, for Gaussian data or quadratic per-sample loss, aligns the weight vector with Fisher's Linear Discriminant. The study shows that logit regularization halves the critical sample complexity and induces grokking in the small-noise limit, while making generalization robust to orthogonal noise. These findings extend the theoretical understanding of label smoothing and highlight the efficacy of a broader class of logit-regularization methods, validated numerically on Gaussian data and ResNet-18 embeddings from CIFAR-10.

Key takeaway

Research scientists developing or deploying linear classifiers should consider implementing convex logit regularization. This technique can significantly improve generalization, especially in high-dimensional settings where unregularized models might overfit, and offers robustness against orthogonal noise. You should anticipate a shift in the interpolation threshold and potential grokking dynamics, which can be managed by adjusting regularization strength.

Key insights

Logit regularization shifts classifier optimization from margin maximization to logit clustering, aligning weights with Fisher's Linear Discriminant.

Principles

Logit regularization creates finite per-sample loss minima.
Generalization accuracy is largely insensitive to regularizer's form.
Optimal generalization accuracy is invariant to orthogonal noise scale.

Method

The proposed method involves adding a convex penalty directly in logit space, which transforms the optimization objective into logit clustering around finite per-sample targets, contrasting with margin maximization.

In practice

Use logit regularization to improve generalization and calibration.
Expect grokking dynamics in weakly regularized, high-dimensional settings.
Logit regularization enhances robustness to orthogonal noise.

Topics

Logit Regularization
Implicit Bias
Label Smoothing
Fisher's Linear Discriminant
Grokking

Best for: Research Scientist, AI Researcher, AI Scientist, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.