Deep Learning Common Terms and Misconceptions

AI, But Simple Issue #21

Hello from the AI, but simple team! If you enjoy our content, consider supporting us so we can keep doing what we do.

Our newsletter is no longer sustainable to run at no cost, so we’re relying on different measures to cover operational expenses. Thanks again for reading!

Deep Learning Common Terms and Misconceptions

AI, But Simple Issue #21

This week, we’ll be going over some commonly used terms in deep learning and machine learning, which will help you build a stronger understanding of deep learning processes and methods.

We’ll take the time to review each term individually, provide clear definitions, and explore use cases to see how they are applied.

Before getting into the explanations, we’re proud to announce that we’re partnering with 1440 Media for this sponsored segment:

Seeking impartial news? Meet 1440.

Every day, 3.5 million readers turn to 1440 for their factual news. We sift through 100+ sources to bring you a complete summary of politics, global events, business, and culture, all in a brief 5-minute email. Enjoy an impartial news experience.

Batch and Mini-Batch

Batches and mini-batches can be confusing as they are used interchangeably when referring to a subset of an entire dataset.

A batch is a subset of the training dataset used to train the neural network in one iteration.

A mini-batch is also a small subset of the training dataset (usually 32, 64, or 128 samples).

A dataset is split up into these “batches” (or mini-batches) after defining a set “batch size”, the number of samples in the batch.

  • A smaller batch size can introduce some noise, which can help with convergence, while a larger batch size takes longer for the model to make an update.

The difference is made during the naming of gradient descent techniques.

  • Gradient descent is used as an optimizer, an optimization technique that allows neural networks to “train”.

  • Optimizers are algorithms used to change the parameter values of the neural network, such as weights and learning rate, to reduce losses.

When the batch size equals the total number of samples in the training dataset, it is called batch gradient descent.

On the other hand, mini-batch gradient descent updates the model through backpropagation every time a batch is processed, and since the batch size for mini-batch is less than the whole dataset, there are more updates than batch gradient descent.

  • Mini-batch gradient descent also introduces some noise, taking from the benefits of stochastic gradient descent to escape plateaus and converge better.

If you want to learn more about the different types of optimizers, feel free to check out this issue.

Hyperparameters and Parameters

Hyperparameters are the settings or configurations that the model doesn’t update after learning from the data.

Hyperparameters are set before training begins and are tuned using hyperparameter tuning methods like grid search or random search.

Some hyperparameters include learning rate, batch size, number of epochs, network architecture, and the choice of activation functions.

  • Certain optimization algorithms like Adam and RMSprop have adaptive learning rates; they adjust the learning rate based on the parameters.

In contrast, parameters are the things that are updated during the training process, and some common examples include weights and biases.

Hyperparameter Tuning

Hyperparameter tuning is the process of selecting the optimal set of hyperparameters for a model.

We have many different hyperparameter tuning methods with different efficiencies.

Grid Search is the complete iterative search through a subset of hyperparameters that the user chooses. It can take a long time depending on the length of the subset, and it can miss the optimal values of the hyperparameter values.

To make this process a bit more efficient, we can use Random Search, where we randomly sample hyperparameters to get a decent and acceptable value quicker.

There is another optimization technique used frequently that involves probabilistic models to select hyperparameters, called Bayesian Optimization.

The advantages of Bayesian Search and Random Search are seen in the above photo. The subset chosen in Grid Search may not pick the optimal value at all, but if we randomize the process, we can end up luckily picking some almost optimal values.

The loss curve shown in the photo is not known; we’re just guessing with configurations to try to reach a minimum loss.

There are other more niche and research-heavy optimization techniques, such as hyperband, but the previously mentioned methods should be sufficient for most personal projects.

  • Read more about hyperparameter tuning, which is covered in much more detail in this issue.

Epochs and Iterations

An epoch in neural network training is one complete pass through the entire training dataset.

During an epoch, the model sees every training example once, then goes to update parameters.

The number of epochs you choose multiplied by the number of batches in your dataset determines how many times the optimizer updates the parameters.

  • For instance, if you are training a network using mini-batch gradient descent, have a dataset with 100 entries, your batch size is 10 entries, and you run it for 10 epochs, the optimizer (mini-batch GD) will update the parameters 100 times (since it updates 10 times per epoch, for 10 epochs).

One update of the model’s parameters, typically after processing one batch, is actually called an iteration, not an epoch.

While increasing epochs allows the model to learn more from the data, too many epochs can lead to overfitting, where the model performs well on training data but poorly on unseen data.

Overfitting and Underfitting

Overfitting is when a model learns the training data too well, including its noise, which results in poor performance on new and unseen data.

  • Overfitting typically happens when the model has too many parameters for the actual complexity of the task or when there is not enough data to properly train the model.

Underfitting is when a model is too simple to capture the actual pattern of the data, performing poorly for previously learned data and new data.

This can happen by not training the model for enough iterations or having an architecture that is too simple for the task.

Here’s how the loss curve looks when a model is underfitting/overfitting:

Regularization

Regularization is a set of techniques used to prevent overfitting by adding additional information or constraints to the model.

There are many different types of regularization. Here are some common examples:

  • L1 Regularization: Adds a penalty term (regularization term) to the loss function that is equal to the absolute value of the model’s weights. The model is not just trying to minimize the error between its predictions and the actual data but is also being encouraged to keep the weights small. L1 regularization is useful for feature selection.

  • L2 Regularization: Adds a penalty that is equal to the square of the magnitude of weights. This encourages the model to distribute weights more evenly and prevents any single weight from becoming too large. This helps in reducing overfitting by smoothing the model and making it less sensitive to noise in the training data.

  • Dropout: Randomly sets a number of input neuron’s values to 0 during training to prevent overfitting.

  • Early Stopping: Stops training when performance on a validation set starts to degrade.

The cost function with L1 Regularization is plotted on the left; the absolute value function forms a diamond-like shape. L2 Regularization makes the cost function circular and parabolic due to the squaring used in the regularization term.

Residual Connections/Skip Connections

These are neural network connections that skip one or more layers, allowing gradients to flow directly to the next layers, diminishing the vanishing gradient problem.

These connections were introduced in Residual Networks (ResNets), and they allow the network to pass the input directly without being modified by any transformation at a layer.

This enables training of very deep networks without too much loss of information.

If you want to learn more about ResNets, we have an issue explaining it here.

Attention Mechanisms

Attention mechanisms are used frequently in transformers, a type of network that revolutionized the Natural Language Processing (NLP) space.

They allow models to focus on specific parts of the input when generating an output.

They assign a sort of “importance” to tokens, and they let the model learn the context between certain words in a sentence, allowing the model to perform well in NLP tasks.

If you want to read more about the attention mechanism, we have an issue explaining it along with the transformer architecture here.

If you enjoy our content, consider supporting us so we can keep doing what we do. Please share this with a friend!

Feedback, inquiries, advertising? Send us an email at [email protected].

If you like, you can also donate to our team to push out better newsletters every week!

That’s it for this week’s issue of AI, but simple. See you next week!

—AI, but simple team

Subscribe to keep reading

This content is free, but you must be subscribed to AI, But Simple to continue reading.

Already a subscriber?Sign In.Not now