๐Ÿ’ก Mathematical Insight

Using the chain rule and vectorization, backprop does this efficiently:

  • Gradients of cost w.r.t. output layer are calculated.

  • Then, errors are propagated backwards, layer by layer, using matrix operations:

    • ฮด^l = ((W^{l+1})^T ฮด^{l+1}) โŠ™ ฯƒโ€ฒ(z^l)

  • Weight gradients are just:

    • โˆ‚C/โˆ‚W = a^{l-1} * ฮด

Why do you think the errors (ฮด) are defined in terms of โˆ‚C/โˆ‚z rather than โˆ‚C/โˆ‚a (activations)?

Think from a chain rule and gradient flow perspective. How does that simplify computations?

๐ŸŽฏ Goal

Minimize the cost function by adjusting weights and biases using gradient descent.


๐Ÿงช Given

  • Training examples:
  • Learning rate:
  • Cost function:
  • Activation function:
  • Derivative:
  • Total layers:
  • Mini-batch size:

๐Ÿงฎ Step-by-Step (Per Example )

1. ๐Ÿš€ Feedforward

For each layer


2. ๐ŸŽฏ Output Error (Final Layer)

For quadratic cost:

So,


3. ๐Ÿ” Backpropagate Error

For


4. ๐Ÿงท Gradient Descent Update

For each layer

  • Weights:
  • Biases:

What are 4 fundamental equations?

(BP1) Error at the output layer

  • : error vector at output layer
  • : how cost changes w.r.t. output activations
  • : slope of activation at weighted input
  • Meaning: error = โ€œhow wrong the output wasโ€ ร— โ€œhow sensitive the neuron isโ€

(BP2) Error at hidden layers

  • Pushes the error backward through weights
  • Multiplies by local derivative
  • Meaning: hidden neurons inherit error signals from later layers, scaled by their influence

(BP3) Gradient w.r.t. biases

  • Each bias learns directly from its neuronโ€™s error
  • Meaning: bias update is simply the error itself

(BP4) Gradient w.r.t. weights

  • Weight update = input activation ร— output error
  • Meaning: connection strengthens/weakens in proportion to how active the input was and how wrong the output turned out

๐Ÿ”— Putting it together

  1. Compute output error (BP1).
  2. Propagate error backward (BP2).
  3. Use deltas to compute gradients for biases (BP3) and weights (BP4).
  4. Update parameters with gradient descent.