AI, But Simple
Posts
MLP Learning Process: The Math Behind Deep Learning

MLP Learning Process: The Math Behind Deep Learning

AI, But Simple Issue #18

Edwin Dong
September 23, 2024

Math of MLPs: Backpropagation and Learning

AI, But Simple Issue #18

Last week, we discussed the forward pass of a Multi-layer Perceptron (MLP). We went through pre-activation, activation, and layer-to-layer matrix math using the iris dataset.

If you want to read it, you can find it here

This week’s letter is a continuation of the MLP learning process, notably the backwards pass and updates.

We’ll continue to use the iris dataset for the examples.

To summarize, the entire training process of the MLP goes like this:

Randomly initialize weights and biases
Forward Propagation
Compute Loss
Backward Propagation
- Compute the gradients of the loss with respect to weights and biases
Update Weights and Biases
- Use an optimization algorithm like Gradient Descent to update the weights and biases
Repeat

We’ll be focusing more on backpropagation and parameter updates in more detail in this issue.

But before we get into the gradient calculations, let’s review some basic multivariable calculus:

A partial derivative (∂y/∂x) is a derivative of f with respect to some variable x.

For example, if f(x,y)=x² + y², (∂f/∂x)=2x, because y² is simply a constant with respect to x.

Make sure you understand how partial derivatives work and how the chain rule for multivariable functions works before proceeding.

Let’s use the same sample we used last time:

Sepal Length (x₁): 5.1 cm
Sepal Width (x₂): 3.5 cm
Petal Length (x₃): 1.4 cm
Petal Width (x₄): 0.2 cm

This time, we’ll include a label: Iris Setosa (Class 0)

The true label vector y is one-hot encoded to be:

We’ll use the same 4 initially random weight matrices and biases, shown below:

W⁽²⁾ is size 3×6 and b⁽²⁾ is size 3×1; they are not pictured.

	Sponsored AI ConfidentialAccelerating AI Innovation With Data Privacy & Sovereignty

The model will remain the same, with 4 neurons in the input layer (one per feature), 6 neurons in the hidden layer, and 3 neurons in the output layer (one per class):

Continuing from last week’s example, let’s assume the model comes up with this output vector:

Error Term for Output Layer

We start off by computing the error at the output layer, noted as δ⁽²⁾ (delta is the error, the superscript denotes that it is the 2nd error as we are working backwards).

Subscribe to keep reading

This content is free, but you must be subscribed to AI, But Simple to continue reading.

Already a subscriber?Sign in.Not now