- AI, But Simple
- Posts
- Efficient Optimization Algorithms: Mathematically Explained
Efficient Optimization Algorithms: Mathematically Explained
AI, But Simple Issue #25

Hello from the AI, but simple team! If you enjoy our content, consider supporting us so we can keep doing what we do.
Our newsletter is no longer sustainable to run at no cost, so we’re relying on different measures to cover operational expenses. Thanks again for reading!
Efficient Optimization Algorithms: Mathematically Explained
AI, But Simple Issue #25
In deep learning, an optimizer (or optimization algorithm) is a crucial element that fine-tunes a neural network’s parameters (weights and biases) during training.
Last week, we went over some simple optimization algorithms that were easy to use and easy to understand. It is highly recommended that you read that issue first, so if you want to check it out, it can be found in the link below:
Before continuing, it is important to have a good understanding of neural networks and the mathematics behind them (which can be found in the previous issue).
Make sure you understand basic multivariable calculus, linear algebra, and have a solid understanding of gradient descent.
Before we dive into today’s issue, we’re proud to announce that we’ve partnered with 1440 Media for this sponsored segment:

Fact-based news without bias awaits. Make 1440 your choice today.
Overwhelmed by biased news? Cut through the clutter and get straight facts with your daily 1440 digest. From politics to sports, join millions who start their day informed.
Coming back to the topic of optimizers, in real practice, the optimization methods mentioned in last week’s issue do not work that well compared to more modern ones in terms of efficiency and efficacy.
Weirdly shaped parameter spaces and other issues become very apparent in many of the simple gradient descent based optimization algorithms.
For instance, in Stochastic Gradient Descent (SGD) or Mini-batch Gradient Descent (Mini-batch GD), the optimizer will oscillate a lot, leading to a longer convergence time and the risk of getting stuck in saddle points or other flat areas.
Although updating the parameters in batches or at every data point decreased the risk of getting stuck in these points, we can still make improvements as they are not fully effective at prevention.