- AI, But Simple
- Posts
- Efficient Optimization Algorithms: Mathematically Explained
Efficient Optimization Algorithms: Mathematically Explained
AI, But Simple Issue #25
Hello from the AI, but simple team! If you enjoy our content, consider supporting us so we can keep doing what we do.
Our newsletter is no longer sustainable to run at no cost, so we’re relying on different measures to cover operational expenses. Thanks again for reading!
Efficient Optimization Algorithms: Mathematically Explained
AI, But Simple Issue #25
In deep learning, an optimizer (or optimization algorithm) is a crucial element that fine-tunes a neural network’s parameters (weights and biases) during training.
Last week, we went over some simple optimization algorithms that were easy to use and easy to understand. It is highly recommended that you read that issue first, so if you want to check it out, it can be found in the link below:
Before continuing, it is important to have a good understanding of neural networks and the mathematics behind them (which can be found in the previous issue).
Make sure you understand basic multivariable calculus, linear algebra, and have a solid understanding of gradient descent.
Before we dive into today’s issue, we’re proud to announce that we’ve partnered with 1440 Media for this sponsored segment:
News for humans, by humans.
Today's news.
Edited to be unbiased as humanly possible.
Every morning, we triple-check headlines, stories, and sources for bias.
All by hand with no algorithms.
Coming back to the topic of optimizers, in real practice, the optimization methods mentioned in last week’s issue do not work that well compared to more modern ones in terms of efficiency and efficacy.
Weirdly shaped parameter spaces and other issues become very apparent in many of the simple gradient descent based optimization algorithms.
For instance, in Stochastic Gradient Descent (SGD) or Mini-batch Gradient Descent (Mini-batch GD), the optimizer will oscillate a lot, leading to a longer convergence time and the risk of getting stuck in saddle points or other flat areas.
Although updating the parameters in batches or at every data point decreased the risk of getting stuck in these points, we can still make improvements as they are not fully effective at prevention.