Why Gradient Descent Became Stochastic

2026-05-29 · Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Mathematics & Computational Sciences · Depth: Intermediate, long

Summary

The article details the mathematical derivation of linear regression parameters using both the normal equation and gradient descent, then introduces stochastic gradient descent. It begins with a simple linear regression example, deriving slope and intercept formulas, then generalizes to multiple features using matrix notation to derive the normal equation. The computational expense of the normal equation for large datasets, due to matrix inversion, is highlighted. Gradient descent is presented as an iterative alternative, detailing its update mechanism and the critical role of the learning rate. Finally, Stochastic Gradient Descent (SGD) is introduced as an optimization for very large datasets by updating parameters using single observations, contrasting it with batch gradient descent and mentioning mini-batch gradient descent.

Key takeaway

For Machine Learning Engineers optimizing linear regression models, you should prioritize iterative methods like Gradient Descent or Stochastic Gradient Descent when working with large datasets. The Normal Equation, while providing a closed-form solution, becomes computationally prohibitive due to matrix inversion with millions of data points or thousands of features. Carefully tune your learning rate to ensure efficient convergence without overshooting the optimal parameters, especially in deep learning where closed-form solutions are rare.

Key insights

Gradient Descent and its variants offer scalable alternatives to the computationally intensive Normal Equation for optimizing linear regression on large datasets.

Principles

Normal Equation is exact but costly for large datasets due to matrix inversion.
Gradient Descent iteratively minimizes loss by moving opposite to the gradient.
Learning rate critically impacts Gradient Descent's convergence speed and stability.

Method

Gradient Descent iteratively updates model parameters β using β := β - α∂MSE/∂β, where α is the learning rate and ∂MSE/∂β is the loss function's gradient.

In practice

Use Normal Equation for small to medium datasets.
Employ Gradient Descent for large datasets to avoid matrix inversion.
Adjust learning rate to prevent slow convergence or overshooting.

Topics

Gradient Descent
Stochastic Gradient Descent
Linear Regression
Normal Equation
Optimization Algorithms
Machine Learning Math

Best for: AI Student, Machine Learning Engineer, Data Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.