- AI, But Simple
- Posts
- MLP Learning Process: The Math Behind Deep Learning
MLP Learning Process: The Math Behind Deep Learning
AI, But Simple Issue #18
Math of MLPs: Backpropagation and Learning
AI, But Simple Issue #18
Last week, we discussed the forward pass of a Multi-layer Perceptron (MLP). We went through pre-activation, activation, and layer-to-layer matrix math using the iris dataset.
If you want to read it, you can find it here
This week’s letter is a continuation of the MLP learning process, notably the backwards pass and updates.
We’ll continue to use the iris dataset for the examples.
To summarize, the entire training process of the MLP goes like this:
Randomly initialize weights and biases
Forward Propagation
Compute Loss
Backward Propagation
Compute the gradients of the loss with respect to weights and biases
Update Weights and Biases
Use an optimization algorithm like Gradient Descent to update the weights and biases
Repeat
We’ll be focusing more on backpropagation and parameter updates in more detail in this issue.
But before we get into the gradient calculations, let’s review some basic multivariable calculus:
A partial derivative (∂y/∂x) is a derivative of f with respect to some variable x.
For example, if f(x,y)=x2 + y2, (∂f/∂x)=2x, because y2 is simply a constant with respect to x.
Make sure you understand how partial derivatives work and how the chain rule for multivariable functions works before proceeding.
Let’s use the same sample we used last time:
Sepal Length (x1): 5.1 cm
Sepal Width (x2): 3.5 cm
Petal Length (x3): 1.4 cm
Petal Width (x4): 0.2 cm
This time, we’ll include a label: Iris Setosa (Class 0)
The true label vector y is one-hot encoded to be:
We’ll use the same 4 initially random weight matrices and biases, shown below:
W(2) is size 3×6 and b(2) is size 3×1; they are not pictured.
|
The model will remain the same, with 4 neurons in the input layer (one per feature), 6 neurons in the hidden layer, and 3 neurons in the output layer (one per class):
Continuing from last week’s example, let’s assume the model comes up with this output vector:
Error Term for Output Layer
We start off by computing the error at the output layer, noted as δ(2) (delta is the error, the superscript denotes that it is the 2nd error as we are working backwards).