NN Learning

💡 Mathematical Insight

Using the chain rule and vectorization, backprop does this efficiently:

Gradients of cost w.r.t. output layer are calculated.
Then, errors are propagated backwards, layer by layer, using matrix operations:
- $δ^{l} = ((W^{l + 1})^{T} δ^{l + 1}) ⊙ σ' (z^{l})$
Weight gradients are just:
- ∂C/∂W = $a^{l - 1} * δ^{l}$

Why do you think the errors (δ) are defined in terms of ∂C/∂z rather than ∂C/∂a (activations)?

Think from a chain rule and gradient flow perspective. How does that simplify computations?

🎯 Goal

Minimize the cost function $C$ by adjusting weights and biases using gradient descent.

🧪 Given

Training examples: $x$
Learning rate: $η$
Cost function: $C = \frac{1}{2} ∥ y - a^{L} ∥^{2}$
Activation function: $σ$
Derivative: $σ^{'}$
Total layers: $L$
Mini-batch size: $m$

🧮 Step-by-Step (Per Example $x$ )

1. 🚀 Feedforward

For each layer $l = 2, 3, \dots, L$

z_{l}^{x} = W_{l} a_{l - 1}^{x} + b_{l}

a_{l}^{x} = σ (z_{l}^{x})

2. 🎯 Output Error (Final Layer)

δ_{L}^{x} = \nabla_{a} C_{x} ⊙ σ^{'} (z_{L}^{x})

For quadratic cost:

\nabla_{a} C_{x} = a_{L}^{x} - y

So,

δ_{L}^{x} = (a_{L}^{x} - y) ⊙ σ^{'} (z_{L}^{x})

3. 🔁 Backpropagate Error

For $l = L - 1, L - 2, \dots, 2$

δ_{l}^{x} = ((W^{l + 1})^{T} δ_{l + 1}^{x}) ⊙ σ^{'} (z_{l}^{x})

4. 🧷 Gradient Descent Update

For each layer $l = L, L - 1, \dots, 2$

Weights:

W^{l} \leftarrow W^{l} - \frac{η}{m} x \sum δ_{l}^{x} (a_{l - 1}^{x})^{T}

Biases:

b^{l} \leftarrow b^{l} - \frac{η}{m} x \sum δ_{l}^{x}

What are 4 fundamental equations?

(BP1) Error at the output layer

$δ^{L} = \nabla_{a} C ⊙ σ ’ (z^{L})$

$δ^{L}$ : error vector at output layer
$\nabla_{a} C$ : how cost changes w.r.t. output activations
$σ ’ (z^{L})$ : slope of activation at weighted input
Meaning: error = “how wrong the output was” × “how sensitive the neuron is”

(BP2) Error at hidden layers

$δ^{l} = ((W^{l + 1})^{T} δ^{l + 1}) ⊙ σ ’ (z^{l})$

Pushes the error backward through weights
Multiplies by local derivative
Meaning: hidden neurons inherit error signals from later layers, scaled by their influence

(BP3) Gradient w.r.t. biases

$\frac{\partial C}{\partial b _{j}^{l}} = δ_{j}^{l}$

Each bias learns directly from its neuron’s error
Meaning: bias update is simply the error itself

(BP4) Gradient w.r.t. weights

$\frac{\partial C}{\partial w _{jk}^{l}} = a_{k}^{l - 1} δ_{j}^{l}$

Weight update = input activation × output error
Meaning: connection strengthens/weakens in proportion to how active the input was and how wrong the output turned out

🔗 Putting it together

Compute output error (BP1).
Propagate error backward (BP2).
Use deltas to compute gradients for biases (BP3) and weights (BP4).
Update parameters with gradient descent.

🤖🧠 Deep mind AI blog series

Explorer

NN Learning

💡 Mathematical Insight

🎯 Goal

🧪 Given

🧮 Step-by-Step (Per Example $x$ )

1. 🚀 Feedforward

2. 🎯 Output Error (Final Layer)

3. 🔁 Backpropagate Error

4. 🧷 Gradient Descent Update

What are 4 fundamental equations?

(BP3) Gradient w.r.t. biases

(BP4) Gradient w.r.t. weights

🔗 Putting it together

Graph View

Table of Contents

Backlinks

🤖🧠 Deep mind AI blog series

Explorer

NN Learning

💡 Mathematical Insight

🎯 Goal

🧪 Given

🧮 Step-by-Step (Per Example x)

1. 🚀 Feedforward

2. 🎯 Output Error (Final Layer)

3. 🔁 Backpropagate Error

4. 🧷 Gradient Descent Update

What are 4 fundamental equations?

(BP3) Gradient w.r.t. biases

(BP4) Gradient w.r.t. weights

🔗 Putting it together

Graph View

Table of Contents

Backlinks

🧮 Step-by-Step (Per Example $x$ )