AI, But Simple
Posts
(Supporter Only) Efficient Optimization Algorithms: Mathematically Explained

(Supporter Only) Efficient Optimization Algorithms: Mathematically Explained

AI, But Simple Issue #25

Edwin Dong
November 11, 2024

(Supporter Only) Efficient Optimization Algorithms: Mathematically Explained

AI, But Simple Issue #25

In deep learning, an optimizer (or optimization algorithm) is a crucial element that fine-tunes a neural network’s parameters (weights and biases) during training.

Last week, we went over some simple optimization algorithms that were easy to use and easy to understand. It is highly recommended that you read that issue first, so if you want to check it out, it can be found in the link below:

Optimization Algorithms Mathematically Explained (Intermediate)

Before continuing, it is important to have a good understanding of neural networks and the mathematics behind them (which can be found in the previous issue).

Make sure you understand basic multivariable calculus, linear algebra, and have a solid understanding of gradient descent.

In real practice, the optimization methods mentioned in last week’s issue do not work that well compared to more modern ones in terms of efficiency and efficacy.

Weirdly shaped parameter spaces and other issues become very apparent in many of the simple gradient descent based optimization algorithms.

For instance, in Stochastic Gradient Descent (SGD) or Mini-batch Gradient Descent (Mini-batch GD), the optimizer will oscillate a lot, leading to a longer convergence time and the risk of getting stuck in saddle points or other flat areas.

Although updating the parameters in batches or at every data point decreased the risk of getting stuck in these points, we can still make improvements as they are not fully effective at prevention.

To combat these problems, one of the first things researchers decided to do was to introduce something called momentum.

Momentum

Momentum is an extension of gradient descent that helps smooth out the updates and speeds up convergence. It does this by taking the past gradients into account when updating the parameters.

It is also used to escape saddle points (points of inflection) and flat areas like plateaus where we oscillate back and forth with no progress.

In traditional gradient descent, we update the weights by moving in the direction of the negative gradient.

In momentum, however, we maintain a “velocity” term that accumulates past gradients and influences the current update, which speeds up the progress in the direction of the consistent gradients.

We store the direction of the previous gradients in the velocity term. Formally speaking, the velocity term is the exponentially weighted average of the gradient of previous steps.

Here’s an example of momentum in work using gradient descent:

The red arrows show the direction taken by one step of gradient descent with momentum.

The blue dotted lines show the direction of the gradient (of the current minibatch) on each step.

Instead of just following the gradient, we let the velocity influence the direction of the step.

Mathematically, the velocity term looks something like this for batch gradient descent:

Here, t represents a timestep (or think of t-1 as the old and t as the new), v_t is the velocity term, β is the momentum coefficient (tunable hyperparameter, often around 0.9), and ∇_WL(W_t-1) is the gradient of the loss (think partial derivative) with respect to W_t-1^_.

We can add this velocity term onto gradient descent (batch gradient descent) to obtain this gradient descent and momentum formula:

Here, α is the learning rate (tunable hyperparameter), v_tis the velocity term that was previously mentioned, W_t-1 is the old weight matrix, and W_t is the new (updated) weight matrix.

Just to be clear, if we’re using a weight matrix or vector, the learning rate (α) will undergo an element-wise multiplication (Hadamard product) to v_t.

Just to make the notation consistent across the velocity and weights, we’ve opted to include t and t-1 instead of t and t+1.

We’re also using a new notation in the velocity formula for the gradient: the gradient notation.

Just a quick reminder, the gradient (given by ∇_xL) is a collection of all such partial derivatives with respect to each component (or entry) in a variable x (which can be a matrix, vector, etc.).

It is the same thing as the partial derivative of the loss with respect to a weight matrix or a bias vector. It’s simply a different notation.

Back to the idea of using momentum to reduce oscillations: although we can use velocity for Batch Gradient Descent, it’s more practical to use momentum on either Stochastic Gradient Descent or Mini-batch Gradient Descent since they see more oscillations—batch gradient descent is less prone to oscillations.

So, in most cases, our velocity term using SGD or Mini-batch would look something like this:

The only difference between SGD with momentum and Batch GD with momentum is that ∇_WL(W_t−1;x⁽ⁱ⁾,y⁽ⁱ⁾) is the gradient of the loss function L with respect to W for the randomly selected data point (x⁽ⁱ⁾,y⁽ⁱ⁾).

The same principle applies to Mini-batch and Batch GD with momentum. The average over all gradients of each point is used instead of the gradient over the whole dataset.

The parameter update rule would look the same for all 3:

Although gradient descent with momentum has improvements over batch gradient descent, it still isn’t the most efficient. Furthermore, the momentum coefficient still requires tuning as well as the learning rate.

Researchers have been looking into changing the learning rate at specific times to improve convergence. This is what led to the creation of Adagrad.

Adaptive Gradient Optimization (Adagrad)

Adagrad is an optimization algorithm that adapts the learning rate for parameters over time (like weight matrices or bias vectors) based on the past gradients of that parameter.

It does this by dividing the learning rate by another term based on the past gradients.

Parameters that have larger gradients get smaller learning rates over time, while parameters with smaller gradients get larger learning rates.

Pay close attention to the size of Adagrad’s steps to see the adaptive learning rate in action.

Changing the learning rates as an approach works well for sparse data (many data points are 0 or empty) or when certain features are more informative than others.

Adagrad can be used with the entire dataset (full-batch), mini-batches, or data points just like SGD and Mini-Batch GD.

Full-batch Adagrad

In full-batch Adagrad, the gradient calculation uses the entire dataset for each update.

Above, s_t is the cumulative sum of squared gradients at time step t, ∇_WL(W_t-1) is the gradient of the loss with respect to a parameter W_t-1, α is the learning rate, and ϵ is a small constant (often 10^-8 ) to prevent division by zero.

Instead of using an exponential moving average like momentum, Adagrad accumulates the square of each gradient over time to influence the learning rate.

If the previous gradients are large, based on the formula for s_t, s_t is large. It reduces the learning rate for parameters since it’s dividing the learning rate with a greater value.

In the opposite sense, if the previous gradients are small, s_t is smaller. It divides the learning rate with a smaller value, making the learning rate larger than if it were divided with a larger value.

Mini-batch Adagrad

When used with mini-batches, Adagrad updates s_t based on the gradient calculated over each mini-batch.

The sum of squared gradients over the mini-batch can be more stable than a purely stochastic approach (updating based on a single sample), but less stable than full-batch.

Here, ∇_WL(W_t-1;x⁽ⁱ⁾,y⁽ⁱ⁾) is the gradient of loss with respect to W_t-1 for each point (x⁽ⁱ⁾,y⁽ⁱ⁾) in the mini-batch, where B is the batch size. In W_t’s calculation, the gradients are summed and divided by the batch size.

Stochastic Adagrad

In stochastic Adagrad, s_t is updated based on individual samples. This setup can make the learning rate change quickly (highly adaptive). This introduces a lot of noise, which may lead to less stable convergence compared to mini-batch or full-batch Adagrad.

Here, ∇_WL(W_t-1;x⁽ⁱ⁾,y⁽ⁱ⁾) is the gradient of loss with respect to W_t-1 for each point (x⁽ⁱ⁾,y⁽ⁱ⁾).

Adagrad is an improvement to previous optimization algorithms in the sense that it changes the learning rate during training to converge better.

However, while Adagrad works well in some situations, it can make the learning rate too small over time, which may slow down convergence.

This led to the development of RMSprop.

Root Mean Square Propagation (RMSprop)

RMSprop took the idea of momentum’s exponential moving average and applied it to try and solve Adagrad’s small learning rates.

RMSprop works with mini-batches and the entire dataset just like Adagrad, so I’ll skip those equations since they’re quite redundant.

The general update rule for RMSprop looks something like this:

Here, s_t is the moving average of squared gradients, ϵ is a small constant to prevent division by zero, ∇_WL(W_t-1) is the gradient of the loss with respect to a parameter W_t-1_,and β is the decay rate (similar to the momentum coefficient, typically around 0.9).

Doesn’t this look familiar? It’s similar to Adagrad’s update rule except instead of summing the squared gradients, it takes the exponential moving average of the squared gradients, only lowering the learning rate within a given time window (as previous influences decay).

This prevents the learning rate from shrinking too quickly and allows it to adapt more smoothly over time, reducing the problem of the learning rate being too small in Adagrad.

Here’s a quick gif comparing some optimizers (ignore Adadelta and NAG):

Notice how all algorithms except for SGD make it down, since they have either adaptive learning rates or momentum.

Although RMSprop is an improvement to adaptive learning rates, it just doesn’t have the usefulness of momentum, which can be extremely helpful in weirdly shaped parameter spaces.

This led to the development of one of the most infamous and used optimizers of all time: Adam.

Adaptive Moment Estimation (Adam)

The Adam optimizer is a combination of momentum and RMSprop, maintaining both an exponential average of past gradients (similar to momentum) and an exponential average of past squared gradients (similar to RMSprop) to adapt the learning rate dynamically.

Adam, like RMSprop and Adagrad, can be used with the whole dataset, mini-batches, and stochastic data points.

Adam uses what we call the first and second moments to calculate the parameter update.

A moment (in mathematics) is a measure of the shape, spread, or characteristics of a data distribution.

The first moment typically refers to the mean or average of a distribution.
- In Adam, the first moment stores the average direction of gradients.
The second moment is usually the variance or magnitude of values.
- In Adam, the second moment captures the scale or magnitude of the gradient.

The general formula for calculating the first (m_t) and second (v_t) moments is shown below:

Here, m_t and v_t are the first and second moment estimates, β₁and β₂ are decay rates, and ∇_WL(W_t-1) is the gradient of the loss with respect to a parameter W_t-1.

In these calculations, v_trepresents the RMSprop portion for adjusting the learning rate, and m_trepresents the momentum portion, allowing the optimizer to speed up convergence and escape flat areas.

In the Adam optimizer, we need to correct the bias of the moments. Without correction, the initial values of the moments can be biased toward zero when there are little gradients to average.

This is done by dividing each moment estimate by a factor related to the decay rate.

Here are bias-corrected estimates of these moments:

m^{^}_t and v^{^}_t are the bias-corrected first and second moments, while B₁^t and B₂^t are the decay rates at the current timestep.

Here’s what the final update rule looks like:

Above, α is the learning rate, and ϵ is a small constant that prevents division by zero.

Adam provides an effective optimization process with adaptive learning rates, allowing for fast convergence and stable updates in nearly all cases.

There are more modern takes on Adam, but we won’t discuss them in this issue. With that said, that leads to the end of this issue!

Here’s a summary table of all the optimizers discussed in this issue:

Momentum	Adds a velocity term to standard gradient descent	Reduces oscillations, speeds up convergence
Adagrad	Adapts learning rate for each parameter based on past squared gradients	Works well for sparse data
RMSprop	Adjusts learning rate based on exponential average of squared gradients	Helps stabilize updates
Adam	Combines momentum and RMSprop	Fast convergence, widely effective

Here’s a special thanks to our biggest supporters:

Sushant Waidande

If you enjoy our content, consider supporting us so we can keep doing what we do. Please share this with a friend!

Feedback, inquiries, advertising? Send us an email at [email protected].

If you like, you can also donate to our team to push out better newsletters every week!

That’s it for this week’s issue of AI, but simple. See you next week!

—AI, but simple team