- AI, But Simple
- Posts
- Batch Normalization
Batch Normalization
AI, But Simple Issue #12
Quick note, this issue will require some knowledge of convolutional neural networks, and neural networks in general, so it would be helpful to review those concepts. More on those topics can be found in our archive: CNN, Neural Networks
Batch Normalization
AI, But Simple Issue #12
Batch Normalization is an essential tool used in modern deep learning workflows.
Soon after it was introduced in the Batch Normalization paper, it completely changed the neural network space, as it helped create deeper neural networks that could be trained faster.
Batch Normalization is a neural network layer that is commonly used in many architectures.
It often gets added as part of a Fully-Connected or Convolutional block and helps stabilize the network during training.
Batch Normalization is denoted as BN
Standardization
Let’s start with some background information about normalizing inputs.
When inputting data into a deep learning model, it is standard practice to normalize the data to have a mean of zero and a standard deviation of one.
This process is referred to as “standardization”, but batch norm does not use it in its name.
So, for each feature column separately, we take the values of all samples in the dataset and compute the mean and the variance. Then, we normalize the values using this formula:
This is also called the z-score
In the picture below, we can see the effect of normalizing data. The original values (in blue) are now centered around zero (in orange). This ensures that all the feature values are now on the same scale.
But why is standardization important?
To understand what happens without standardization, let’s look at an example with just two features that are on drastically different scales.
The network will learn weights for each feature that are also on different scales.
The network would have to make a large update to one weight compared to the other weight.
This can cause the gradient descent trajectory to oscillate back and forth along one dimension, thus taking more steps to reach the minimum.
This uneven trajectory takes longer for the network to converge.
Instead, if the features are on the same scale, the loss landscape is more uniform, like a bowl. Gradient descent can then proceed smoothly down to the minimum.
The need for Batch Normalization
The reason behind Batch Normalization starts to become clear:
The inputs of each hidden layer are the activations from the previous layer, and can also be normalized to improve performance.
If we are able to somehow normalize the activations from each previous layer, then the gradient descent will converge faster during training.
This is precisely what the Batch Norm layer does for us.
Batch Norm is just another neural network layer that gets inserted between two hidden layers.
Its job is to take the outputs from the first hidden layer and normalize them before passing them on as the input of the next hidden layer.
A Batch Norm layer has parameters of its own:
Two learnable parameters called beta and gamma.
Two non-learnable parameters (Mean Moving Average and Variance Moving Average) are saved in the layer.
Each Batch Norm layer has its own copy of the parameters.
Batch normalization process, visualized
During training, we feed the network one mini-batch (a fixed size, mini-batch-size) of data at a time.
So when the mini-batch gets to the batch norm layer, the activations from the previous layer are passed as input to the Batch Norm.
There is one activation vector for each feature in the data.
Then, for each activation vector separately, we calculate the mean and variance of all the values in the mini-batch.
We can calculate the normalized values for each activation feature vector using the corresponding mean and variance.
These normalized values now have a mean of zero and a standard deviation of one.
Unlike the input layer, which requires all normalized values to have zero mean and standard deviation of one, Batch Norm allows its values to be shifted (to a different mean) and scaled (to a different variance).
This aspect is the huge innovation introduced by Batch Norm that makes it so versatile.
It does this by multiplying the normalized values by a factor, gamma, and adding to it a factor, beta.
Note that this is an element-wise multiplication (hadamard), not a matrix multiplication (dot product).
What makes the Batch Norm layer even better is that it is able to optimally find the best factors for itself (since they are parameters), and can thus shift and scale the normalized values to get the best predictions.
In addition to the shifts, Batch Norm also keeps a running count of the Exponential Moving Average (EMA) of the mean and variance.
During training, it simply calculates this EMA but does not do anything with it.
When it has finished training, it simply saves this value as part of the layer’s state for when the model is making predictions.
During inference (making predictions), we have a single sample, not a mini-batch. How do we obtain the mean and variance in that case?
Here is where the two moving average parameters come in—the ones that we calculated during training and saved with the model.
We use those saved mean and variance values for the Batch Norm during inference.
After the forward pass, we do the backward pass as normal.
Gradients are calculated, and updates are done for all layer weights, as well as for all beta and gamma parameters in the Batch Norm layers.
Placement
So where do we place these layers?
There are two opinions for where the Batch Norm layer should be placed in the architecture—before and after the activation.
The original paper placed it before, although you will find both options frequently mentioned in papers.
If you enjoy our content, consider supporting us so we can keep doing what we do. Please share this with a friend!
If you like, you can also donate to our team to push out better newsletters every week!
That’s it for this week’s issue of AI, but simple. See you next week!
—AI, but simple team