Why Gradient Descent Became Stochastic
Summary
The article details the mathematical derivation of linear regression parameters using both the normal equation and gradient descent, then introduces stochastic gradient descent. It begins with a simple linear regression example, deriving slope and intercept formulas, then generalizes to multiple features using matrix notation to derive the normal equation. The computational expense of the normal equation for large datasets, due to matrix inversion, is highlighted. Gradient descent is presented as an iterative alternative, detailing its update mechanism and the critical role of the learning rate. Finally, Stochastic Gradient Descent (SGD) is introduced as an optimization for very large datasets by updating parameters using single observations, contrasting it with batch gradient descent and mentioning mini-batch gradient descent.
Key takeaway
For Machine Learning Engineers optimizing linear regression models, you should prioritize iterative methods like Gradient Descent or Stochastic Gradient Descent when working with large datasets. The Normal Equation, while providing a closed-form solution, becomes computationally prohibitive due to matrix inversion with millions of data points or thousands of features. Carefully tune your learning rate to ensure efficient convergence without overshooting the optimal parameters, especially in deep learning where closed-form solutions are rare.
Key insights
Gradient Descent and its variants offer scalable alternatives to the computationally intensive Normal Equation for optimizing linear regression on large datasets.
Principles
- Normal Equation is exact but costly for large datasets due to matrix inversion.
- Gradient Descent iteratively minimizes loss by moving opposite to the gradient.
- Learning rate critically impacts Gradient Descent's convergence speed and stability.
Method
Gradient Descent iteratively updates model parameters β using β := β - α∂MSE/∂β, where α is the learning rate and ∂MSE/∂β is the loss function's gradient.
In practice
- Use Normal Equation for small to medium datasets.
- Employ Gradient Descent for large datasets to avoid matrix inversion.
- Adjust learning rate to prevent slow convergence or overshooting.
Topics
- Gradient Descent
- Stochastic Gradient Descent
- Linear Regression
- Normal Equation
- Optimization Algorithms
- Machine Learning Math
Best for: AI Student, Machine Learning Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.