Reference:
🏎️ Why is Backpropagation “Fast”?
Prior to backpropagation, one could compute gradients via finite differences, i.e., slightly change each weight, rerun the network, and see how the output changes. But that’s computationally expensive:
- For n weights, you’d need to run the entire network n times just to estimate the gradient.
Backpropagation avoids that by using the chain rule of calculus to reuse intermediate computations — essentially doing a single backward pass that gives all gradients simultaneously.
How use of chain rule used for intermediate computation?
🎯 Goal of Backpropagation:
Compute the gradient of cost function w.r.t. every weight and bias:
To do this, we need to know:
How changing any internal variable (like a weight) affects the final output cost — through a chain of activations and layers.
🔗 Enter the Chain Rule
🔍 Chain Rule in Calculus:
If:
Then:
This “chain” lets us reuse intermediate derivatives to compute final gradient.
🔄 Chain Rule in Neural Networks
Let’s focus on computing gradient w.r.t. a weight.
Suppose:
- You want
- But cost depends on the output, which depends on activations, which depend on weighted inputs, which depend on the weight.
So apply chain rule:
Each term is: → from next layer’s error Putting it together:
Where:
👉 This is BP2, the core of backpropagation — chain rule in vector form.
✅ Why It’s Efficient
- You don’t recompute from scratch for every weight.
- Instead, you compute once per layer, and reuse: and
This is why backpropagation is exponentially faster than finite difference methods.
graph LR
Backward Pass (Gradients)
C --> DA2[dC_da2]
DA2 --> DZ2[dC_dz2 = dC_da2 · σ' z2]
DZ2 --> DW2[dC_dW2 = a1 · delta2]
DZ2 --> DB2[dC_db2 = delta2]
DZ2 --> DA1[dC_da1 = W2^T · delta2]
DA1 --> DZ1[dC_dz1 = dC_da1 · σ' z1]
DZ1 --> DW1[dC_dW1 = x · delta1]
DZ1 --> DB1[dC_db1 = delta1]
style X fill:#c2f0c2
style C fill:#ffd580
style DW1 fill:#f2a3a3
style DW2 fill:#f2a3a3
style DB1 fill:#f2a3a3
style DB2 fill:#f2a3a3
style DZ1 fill:#d3d3f2
style DZ2 fill:#d3d3f2
Back propagation enables us to simultaneously compute all the partial derivatives ∂C/∂wj using just one forward pass through the network
Gradient = Chain of Derivatives
You want:
You reuse intermediate values (activations, derivatives, errors): z^l_j = \sum_k w^l_{jk} a^{l-1}_k + b^l_j $$$$ a^l_j = \sigma(z^l_j) $$$$ \delta^l_j = \frac{\partial C}{\partial z^l_j}So:
This is just an outer product of:
- Activations from layer
- Errors from layer You compute these layer-wise, not per-weight — giving you all gradients in one shot.
In back propagation is it both forward and back ward pass or just backward pass? What is forward and what is backward pass?
🔁 Simple Analogy First
Think of your neural network like a factory:
- 🚚 Input (raw material) goes in
- 🛠️ Each layer transforms it
- 🎯 Final layer gives a product (prediction)
Then you ask:
“How good is the product?” If it’s off, you figure out: “Which part of the factory messed up and by how much?”
🧠 Definitions
✅ Forward Pass
Compute predictions using current weights and activations.
For input , at each layer:
a^l = \sigma(z^l) $$Final output:\hat{y} = a
C = \text{Loss}(\hat{y}, y)
🧠 **Key output** of forward pass: - Activations $$ a^l $$- Pre-activations $$ z^l $$- Loss $$ C $$ --- ### **🔁** **Backward Pass (Backpropagation)** > **Compute gradients** of loss w.r.t. weights and biases using **chain rule**. It starts at output layer:\delta^L = \nabla_a C \odot \sigma’(z^L)
\delta^l = \left( W^{l+1} \right)^T \delta^{l+1} \odot \sigma’(z^l)
\frac{\partial C}{\partial W^l} = \delta^l (a^{l-1})
\frac{\partial C}{\partial b^l} = \delta
🧠 **Key output** of backward pass: - All gradients needed to update parameters |**Stage**|**What Happens**|**Used For**| |---|---|---| |Forward Pass|Compute $$ a^l, z^l $$ and final loss $$ C $$|Model prediction| |Backward Pass|Compute $$ \delta^l $$ and $$ \partial C/\partial W, b $$|Weight updates| |Gradient Descent|Use gradients to update weights|Learning|