How I Finally Understood Backpropagation by Deriving It by Hand
Summary
The article describes a complete, hand-derived explanation of backpropagation for a small neural network, emphasizing understanding the underlying calculus rather than just memorizing formulas. It details the computation of ∂E/∂w for each weight, starting with the output layer gradients. The derivation breaks down the chain rule application for ∂E/∂w₂ into factors: ∂E/∂ŷ = −(y − ŷ), ∂ŷ/∂z₂ = ŷ(1 − ŷ), and ∂z₂/∂w₂ = a₁, combining them to form ∂E/∂w₂ = −(y − ŷ) · ŷ(1 − ŷ) · a₁. The explanation then extends to hidden layer gradients, showing how ∂E/∂w₁ is computed by applying the chain rule through the longer dependency chain w₁ → z₁ → a₁ → z₂ → ŷ → E, introducing terms like ∂z₂/∂a₁ = w₂ and ∂a₁/∂z₁ = a₁(1 − a₁). The article defines δ₂ and δ₁ (output and hidden deltas) to simplify the gradient expressions, concluding that backpropagation is fundamentally the chain rule applied to a neural network's computational graph.
Key takeaway
For Machine Learning Engineers or AI Students struggling with backpropagation's mechanics, deriving it by hand is crucial. This process clarifies how the chain rule computes ∂E/∂w for each weight, revealing the geometric meaning of gradients and enabling you to debug implementations effectively. Invest time in this foundational exercise; it is the highest-return investment for truly understanding neural network behavior.
Key insights
Deriving backpropagation by hand provides a fundamental understanding of neural network gradient computation.
Principles
- Backpropagation applies the chain rule.
- Gradients indicate error change per weight.
- Deriving equations clarifies network behavior.
Method
The article details a step-by-step derivation of backpropagation for a small neural network, computing ∂E/∂w for output and hidden layers using the chain rule and defining delta terms.
In practice
- Derive backpropagation for a small network.
- Work through equations by hand.
- Analyze dependency chains for gradients.
Topics
- Backpropagation
- Neural Networks
- Gradient Descent
- Chain Rule
- Machine Learning Education
- Calculus for AI
Best for: AI Student, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Deep Learning on Medium.