- AI, But Simple
- Posts
- Deep Learning's Top Pitfalls: Why Your Model Struggles (And How to Fix It!)
Deep Learning's Top Pitfalls: Why Your Model Struggles (And How to Fix It!)
AI, But Simple Issue #39

Hello from the AI, but simple team! If you enjoy our content, consider supporting us so we can keep doing what we do.
Our newsletter is no longer sustainable to run at no cost, so we’re relying on different measures to cover operational expenses. Thanks again for reading!
Deep Learning Common Mistakes and Challenges (And How to Fix Them)
AI, But Simple Issue #39
This week, we’ll be going over some common pitfalls in deep learning—and, more importantly, how to fix them for better-performing models. From vanishing gradients to overfitting, we’ll break down key challenges that can negatively impact model performance and explore how to solve them.
Before getting into today’s issue, we’re proud to announce that we’re partnering with 1440 Media for this sponsored segment:
Fact-based news without bias awaits. Make 1440 your choice today.
Overwhelmed by biased news? Cut through the clutter and get straight facts with your daily 1440 digest. From politics to sports, join millions who start their day informed.
Vanishing Gradient
Imagine you’re shining a flashlight through a stack of tinted sunglasses, tinted such that each pair of sunglasses dims the light a little.
If you stack 10 pairs, the beam becomes extremely dim. But imagine if you stacked 20, 30, or even 50 pairs. The final beam would become so weak that you couldn’t see anything.

Here, the flashlight is the "learning signal" in a neural network, and the sunglasses represent the network’s layers. By the time the signal reaches the early layers (the first pairs of sunglasses), it’s too small to adjust a network’s weights meaningfully (the light beam is too dim to be seen). This is a classic case of the vanishing gradient problem.
The math reason behind why it happens is tiny numbers multiplying together. For instance, if each layer in the network reduces the input signal by half (in other words, halves the input value), after just 10 layers, the signal is multiplied by 0.510 = 0.0009, giving extremely small values.