AI, But Simple
Posts
Efficient Optimization Algorithms: Mathematically Explained

Efficient Optimization Algorithms: Mathematically Explained

AI, But Simple Issue #25

Edwin Dong
November 11, 2024

Hello from the AI, but simple team! If you enjoy our content, consider supporting us so we can keep doing what we do.

Our newsletter is no longer sustainable to run at no cost, so we’re relying on different measures to cover operational expenses. Thanks again for reading!

Efficient Optimization Algorithms: Mathematically Explained

AI, But Simple Issue #25

In deep learning, an optimizer (or optimization algorithm) is a crucial element that fine-tunes a neural network’s parameters (weights and biases) during training.

Last week, we went over some simple optimization algorithms that were easy to use and easy to understand. It is highly recommended that you read that issue first, so if you want to check it out, it can be found in the link below:

Optimization Algorithms Mathematically Explained (Intermediate)

Before continuing, it is important to have a good understanding of neural networks and the mathematics behind them (which can be found in the previous issue).

Make sure you understand basic multivariable calculus, linear algebra, and have a solid understanding of gradient descent.

Before we dive into today’s issue, we’re proud to announce that we’ve partnered with 1440 Media for this sponsored segment:

Fact-based news without bias awaits. Make 1440 your choice today.

Overwhelmed by biased news? Cut through the clutter and get straight facts with your daily 1440 digest. From politics to sports, join millions who start their day informed.

Coming back to the topic of optimizers, in real practice, the optimization methods mentioned in last week’s issue do not work that well compared to more modern ones in terms of efficiency and efficacy.

Weirdly shaped parameter spaces and other issues become very apparent in many of the simple gradient descent based optimization algorithms.

For instance, in Stochastic Gradient Descent (SGD) or Mini-batch Gradient Descent (Mini-batch GD), the optimizer will oscillate a lot, leading to a longer convergence time and the risk of getting stuck in saddle points or other flat areas.

Although updating the parameters in batches or at every data point decreased the risk of getting stuck in these points, we can still make improvements as they are not fully effective at prevention.

Subscribe to keep reading

This content is free, but you must be subscribed to AI, But Simple to continue reading.

Already a subscriber?Sign In.Not now