๐ก Mathematical Insight
Using the chain rule and vectorization, backprop does this efficiently:
-
Gradients of cost w.r.t. output layer are calculated.
-
Then, errors are propagated backwards, layer by layer, using matrix operations:
-
ฮด^l = ((W^{l+1})^T ฮด^{l+1}) โ ฯโฒ(z^l)
-
-
Weight gradients are just:
- โC/โW = a^{l-1} * ฮด
Why do you think the errors (ฮด) are defined in terms of โC/โz rather than โC/โa (activations)?
Think from a chain rule and gradient flow perspective. How does that simplify computations?
๐ฏ Goal
Minimize the cost function by adjusting weights and biases using gradient descent.
๐งช Given
- Training examples:
- Learning rate:
- Cost function:
- Activation function:
- Derivative:
- Total layers:
- Mini-batch size:
๐งฎ Step-by-Step (Per Example )
1. ๐ Feedforward
For each layer
2. ๐ฏ Output Error (Final Layer)
For quadratic cost:
So,
3. ๐ Backpropagate Error
For
4. ๐งท Gradient Descent Update
For each layer
- Weights:
- Biases:
What are 4 fundamental equations?
(BP1) Error at the output layer
- : error vector at output layer
- : how cost changes w.r.t. output activations
- : slope of activation at weighted input
- Meaning: error = โhow wrong the output wasโ ร โhow sensitive the neuron isโ
(BP2) Error at hidden layers
- Pushes the error backward through weights
- Multiplies by local derivative
- Meaning: hidden neurons inherit error signals from later layers, scaled by their influence
(BP3) Gradient w.r.t. biases
- Each bias learns directly from its neuronโs error
- Meaning: bias update is simply the error itself
(BP4) Gradient w.r.t. weights
- Weight update = input activation ร output error
- Meaning: connection strengthens/weakens in proportion to how active the input was and how wrong the output turned out
๐ Putting it together
- Compute output error (BP1).
- Propagate error backward (BP2).
- Use deltas to compute gradients for biases (BP3) and weights (BP4).
- Update parameters with gradient descent.