AI, But Simple
Posts
Optimizers in Deep Learning: A simple explanation

Optimizers in Deep Learning: A simple explanation

AI, But Simple Issue #5

Edwin Dong
June 10, 2024

Optimizers in Deep Learning: A Simple Explanation

AI, But Simple Issue #5

In deep learning, an optimizer is a crucial element that fine-tunes a neural network’s parameters during training.

An optimizer is simply an optimization algorithm (like gradient descent), and optimizers use different strategies to converge towards optimal parameter values for improved predictions

Its primary role is to minimize the model’s error or loss function, boosting its accuracy and making it more useful in real world applications.

But before talking more about optimizers, there are a few terms that you should be familiar with:

Epoch—Measure of time: One cycle through the entire training dataset
Batch—Subset of the training data
- Batch size is the number of samples to be taken per update
Learning rate—Parameter that provides the model a scale of how much model weights should be updated.

Also, if you don’t know what weights and biases are, or what a cost/loss function is, feel free to check out this previous issue for an explanation.

In this issue, we won’t go too deep into the mathematical intuition of optimizers, so if you want to understand more about the inner workings of these optimizers, I suggest you check out some other resources.

Let’s start off by talking about one of the most popular choices of optimizers, gradient descent.

Gradient Descent

This optimization algorithm uses calculus to modify the parameters to converge to the local minimum.

Here are the basics:

It takes the gradient of the cost with respect to parameters (weights and biases)
It finds the new weights by subtracting the gradient multiplied by a learning rate

If you want a deeper dive into gradient descent, check out our previous issue about learning parameters (use the find tool and search “gradient descent”).

But as good as gradient descent seems, there can be some issues when using this optimizer.

For instance, it can get very computationally expensive to compute the gradients if the data is very large

Also, gradient descent works well for convex functions, but when it comes to non-convex and more complicated functions, it may not behave the way we want.

For instance, it can get stuck at a local minimum or at saddle points, and end up not finding its way down to the absolute minimum.

Non-convex functions pose a large problem to gradient based optimization algorithms like gradient descent, and new optimizers tackle this problem in various ways.

One simple way is to add some randomness and noise to the convergence, like in Stochastic Gradient Descent:

Stochastic Gradient Descent (SGD)

To tackle the challenges large datasets and non-convex functions pose to gradient descent, we can use stochastic gradient descent.

The word stochastic represents the element of randomness used in the algorithm.

In stochastic gradient descent, instead of processing the entire dataset during each iteration, we randomly select a sample (random shuffle) of data and process it to update our parameters.

This means that only one sample from the dataset is considered at a time, allowing for more efficient and computationally feasible optimization.

However, since we are not using the whole dataset but batches of it for each iteration, the path taken by the algorithm is full of noise as compared to the gradient descent algorithm.

This noise can hinder its ability to converge quickly, but it can be useful in some cases
For instance, this noise can be beneficial to escape from some local minima and valleys (a problem in gradient descent)

Because of this noise, SGD uses a higher number of iterations to reach the local minima.

But if the data is huge, stochastic gradient descent should be preferred over batch gradient descent (also known as gradient descent), since it may be able to converge faster even after increasing the number of iterations.

Momentum

Stochastic gradient descent takes a much more noisy path, so it requires a significantly greater number of iterations to reach the optimal minimum, so computation time can be quite slow.

Subscribe to keep reading

This content is free, but you must be subscribed to AI, But Simple to continue reading.

Already a subscriber?Sign In.Not now