Normalization In Deep Learning

AI, But Simple Issue #29

Hello from the AI, but simple team! If you enjoy our content, consider supporting us so we can keep doing what we do.

Our newsletter is no longer sustainable to run at no cost, so we’re relying on different measures to cover operational expenses. Thanks again for reading!

Normalization In Deep Learning

AI, But Simple Issue #29

Training effective deep neural networks (DNNs) is a challenging process with many interconnected elements.

Over the years, researchers have come up with different methods to accelerate and stabilize the learning process. Normalization is one technique that proved to be very effective in doing this.

In machine learning (ML), normalization is used to center the data to improve the fit of the model. In ML processes, normalization is usually done at the start of the process, normalizing the entire dataset once before feeding it to the model—what we refer to as “pre-processing”.

We then use the normalized data to fit the model, resulting in a higher overall training efficiency.

However, in deep learning, this process is a little different. In recent years, given the increasing complexity and depth of deep learning models, it is common to normalize the data within the network itself at various points between layers.

In this issue, we will explore 3 widely used types of between-layer normalization techniques that speed up training and improve convergence.

Avoid Market Volatility With $200k+ Retirement Savings Strategy

Imagine never worrying about market drops again. Protect your retirement savings, ensure steady growth, lock in a reliable income for life AND add 20% to your account immediately. If you’re:

  • 40-55

  • Live in the US

  • Have 200K+ saved

You may qualify!

Before diving deeper, we recommend having some knowledge of neural networks and normalization. Check out some of our previous issues for an introduction:

Batch Normalization

In 2015, researchers Sergey Ioffe and Christian Szegedy aimed to tackle a critical challenge in training deep learning models.

They called this problem the internal covariate shift of a model, which they identified as a major factor slowing down training, especially in deep neural networks.

But what is internal covariate shift? Essentially, due to weight updates, the distribution of inputs to each layer of the network constantly changes, meaning that the following layers need to adapt to the new distribution. This causes slower convergence and unstable training.

There was a need for quicker and better normalization that did not only happen once but multiple times at various points within the model.

This led them to ask the question: why don’t we normalize the activation of each layer in the network?

This led to the creation of batch normalization (batch norm), a normalization layer that gets inserted between model layers to help standardize and stabilize training, helping optimize this so called “distribution” at each layer.

Batch normalization works similarly to traditional standardization of inputs in the “pre-processing” stage, but we add two learnable parameters, γ and β—allowing the model to effectively normalize input data in any situation.

The release of batch normalization offered several benefits after thorough experimentation:

  • Substantially decreases training time.

  • Removes the necessity of dropout.

  • Decreases the amount of regularization needed (acts as a regularization technique).

  • Allows for an increased learning rate.

Where do we place Batch Norm?

There are two opinions for where the batch norm layer should be placed in the architecture—before and after the activation.

The original paper placed it before, although you will find both options frequently mentioned in papers.

Here’s how placing batch norm in a network would look:

1. Standard Neural Network (Without BN)

  • Input → Linear Transformation → Activation → Next Layer

Each layer applies a linear transformation (Wx+b) and an activation function (e.g., ReLU).

2. Neural Network With Batch Normalization

With batch norm, the flow changes slightly:

  • Input → Linear Transformation → Batch Normalization → Activation → Next Layer

The output of the linear transformation (Wx+b) is normalized using the batch’s mean and variance. It is then scaled and shifted using γ and β—finally being passed through an activation function.

Above, FC stands for “fully connected”.

The Batch Norm Process

To better understand the batch norm process, let’s recall that, accounting for batch size, input data is usually 2, 3, or 4 dimensional:

  • For simple tabular data, the dimensions are (N x W), where N is the batch size and W is the number of features.

  • For grayscale images, the dimensions are (N x H x W); each image has a width W and a height H.

  • For RGB images, the dimensions are (N x H x W x C), since the image has 3 channels (C) for red, green, and blue.

A key difference between tabular data and RGB images is in the treatment of what we call
features.”

In tabular data, the features represent attributes or traits like height, age, or color. In RGB images, the channels become the features—the spatial dimensions (width and height) are not treated as features since they describe the structure of the image.

Here’s a visual overview of the Batch Norm process:

Imagine each of the piles of squares as a book.

Each book (with dimensions H x W x C) has multiple pages (C), with each page having height and width H x W.

Now, think of a row of N books on a bookshelf as a batch of images (N x C x H x W).

To apply batch norm, we pick the first page of each book and normalize it. We then repeat this process for the rest of the C - 1 pages.

This normalizes data across the batch but for each feature or channel (C) separately.

  • Specifically, for each feature, batch normalization computes the mean and variance of that feature in the mini-batch.

  • It then subtracts the mean and divides the feature by its mini-batch standard deviation.

This standardizes the data, leaving the data with unit variance and zero mean.

Drawbacks

Although batch norm is extremely effective, it certainly has its weaknesses.

A problem with batch norm is that its performance decreases dramatically when there are not enough samples (N) or the batch size (N) is too small.

It can also be time-consuming due to the way it is implemented. This is not ideal in certain scenarios, such as tasks with a low batch size or tasks with a variable batch size (such as online learning).

The significance of this problem pushed the community to create alternative methods to avoid dependency on the batch.

Layer Normalization

In 2016, Geoffrey E. Hinton et al. introduced layer normalization (layer norm) as a solution to reduce the batch size constraints and sample size issue of batch norm.

While batch norm worked well for deep neural networks, where the number of layers is fixed and it’s easy to compute the statistics needed for each batch, RNNs presented a larger challenge.

In RNNs, the input and output sequences can vary in length, making it difficult to apply batch norm to them. The hidden states of RNNs are also influenced by prior states, so using a batch-wide normalization does not account for the sequential nature of the data.

  • In fact, batch norm normalizes across both time steps and the batch dimension, which interferes with the dependencies between time steps.

So, in this case, it’s better to normalize using statistics of a single timestep rather than the whole batch.

Using layer norm offers similar benefits to batch norm, solving the internal covariance issue, which leads to more efficient training.

It is no longer dependent on the batch, but it introduces a new potential limitation by normalizing across the channels rather than across the batch, which may not always be beneficial depending on the architecture and task at hand.

The Layer Norm Process

Continuing our analogy of a pile of squares as a book (yellow cube), each book (H x W x C) has multiple pages (C), with each page having height and width H x W.

Here, the books can represent a single image sample with dimensions (H x W x C), or in the context of RNNs, a singular time step with F features and C feature groups.

Thinking in terms of images, a row of N books on a bookshelf is a batch of images (N x H x W x C).

To apply layer norm, we pick up the first book (H x W x C) and normalize it. We then repeat this process for the rest of the N - 1 books.

Layer norm normalizes each individual book independently, across the features (C) instead of on the batch (N). This way, layer norm is not dependent on the batch.

  • Layer normalization computes the mean and variance of a sample across all features.

  • It then subtracts the mean and divides the sample by the feature’s standard deviation, giving unit variance and a mean of zero.

Like batch norm, it speeds up and stabilizes training but without being constrained to the batch.

This method can be used in online learning tasks where the batch is equal to 1.

Instance Normalization

Instance normalization was introduced by Dmitry Ulyanov et al. in their 2016 paper as another attempt to reduce dependency on the batch to improve the results of the style transfer network.

  • A style transfer network is a type of model designed to take 2 input images (a content image and a style image) and combine them so that the output image preserves the content while taking on the specific style.

The Instance Norm Process

Again, think of the structure of an image as a book (H x W x C) with multiple pages (C), with each page having height and width H x W.

Instance norm involves normalizing a single page (green shade) of size H x W across both the channel (C) and batch (N) dimensions.

To apply instance norm, we normalize on the first page of the first book (H x W). We then repeat this process for the rest of the C - 1 pages for N - 1 books.

By normalizing across both the batch and the channels, the model removes unnecessary details, like varying contrast or color shifts, which do not contribute to the core features of an image.

This method gained popularity among generative models like Pix2Pix or CycleGAN and became a precursor to the Adaptive Instance Normalization used in the famous StyleGAN2.

Here’s a special thanks to our biggest supporters:

Sushant Waidande

If you enjoy our content, consider supporting us so we can keep doing what we do. Please share this with a friend!

Feedback, inquiries, advertising? Send us an email at [email protected].

If you like, you can also donate to our team to push out better newsletters every week!

That’s it for this week’s issue of AI, but simple. See you next week!

—AI, but simple team

Subscribe to keep reading

This content is free, but you must be subscribed to AI, But Simple to continue reading.

Already a subscriber?Sign In.Not now