- AI, But Simple
- Posts
- Convolutional Neural Networks, Explained Mathematically
Convolutional Neural Networks, Explained Mathematically
AI, But Simple Issue #20
Hello from the AI, but simple team! If you enjoy our content, consider supporting us so we can keep doing what we do.
This week, for a more thorough explanation, please consider upgrading!
Our newsletter is no longer sustainable to run at no cost, so we’re relying on different measures to cover operational expenses. Thanks again for reading!
Convolutional Neural Networks, Explained Mathematically
AI, But Simple Issue #20
Convolutional Neural Networks (CNNs) are a special type of neural network designed to process grid-like data, such as images.
They have revolutionized the computer vision space with their high performance and efficient architecture.
They are particularly effective for image classification tasks due to their ability to capture patterns through convolutional layers.
We’ve gone through the concept, terms, and basics of CNNs in previous issues, so in this one, we’ll focus more on the mathematical process (explained simply).
If you want to read those issues (which are highly recommended), you can find them here:
Convolution Neural Networks (Beginner)
Learn Stride, Padding, Window Size
Modern Convolutional Neural Network Architectures (Intermediate)
Learn CNN blocks, VGGNet, ResNet
To explain how the math behind a CNN works, we’ll go through a worked example using the CIFAR-10 dataset for image classification.
CIFAR-10 consists of 60,000 color images, each of size 32×32 pixels with 3 color channels (RGB), categorized into 10 distinct classes such as airplanes, cars, birds, cats, etc.
The CNN architecture we will cover today is picked quite arbitrarily, but with some purpose. We’ll shape the CNN into blocks (convolution, activation, pooling), which is quite effective.
The CNN architecture we’re using has the following layers:
Input Layer: 32×32×3 RGB image.
Block 1
Convolutional Layer 1 (Conv1)
Activation Function: ReLU.
Pooling Layer 1 (Pool1): Max pooling.
Block 2
Convolutional Layer 2 (Conv2)
Activation Function: ReLU.
Pooling Layer 2 (Pool2): Max pooling.
Flatten Layer: Converts the 3D tensor to a 1D vector.
Fully Connected Layer (FC): 512 neurons.
Activation Function: ReLU.
Output Layer: 10 neurons (one for each class) with softmax activation.
If you remember from a previous issue, the standard CNN block contains convolutional layers with a stride of 1 and a padding of 1 (which maintains the resolution), followed by a nonlinearity (activation function, usually ReLU), then a pooling layer to reduce the resolution.
In our CNN, we use the same type of block to better fit the data.
Before we get into the explanation, we’re glad to announce that we’ve partnered with Pressmaster AI for this sponsored segment:
Better PR with minimal effort: Let AI write your articles
Generate high-quality articles in seconds - SEO-optimized, plagiarism & fact-checked
Be featured for free by journalists looking for credible sources and build authority
Get your articles indexed and ranked directly in Google News
Distribute your content to top magazines with a single click
Forget about ChatGPT: Manage, publish and track your PR efforts in one place
Get great PR fast:
Let’s get started with the forward propagation.
The input is composed of a tensor of size 32×32×3 (height, width, and depth). We’ll call this tensor X.
Block 1
Convolutional Layer 1 (Conv1)
Activation Function: ReLU.
Pooling Layer 1 (Pool1): Max pooling.
To propagate this input forward, we transform the input through the first convolutional layer.
We choose to use 32 filters (or kernels) for the convolutional layer, which should be enough for the model to learn abstract representations of the image.
Since we have 32 filters, we will end up with 32 feature maps (which are the matrix outputs of a convolution process), ending up with an output tensor of 32×32×32 since we can stack the feature maps.
We use a convolutional window size of 3×3, which is massively popular thanks to the VGGNet, becoming a gold standard in deep networks.
The window size is 3×3, but it runs across the entire depth of 3 as well, so each filter has an actual size of 3×3×3.
We use a stride of 1 and a padding of 1 to maintain the resolution of the image.
The calculations for the convolution operation can be represented with this image (although the window size and input size is not the same):
The convolution operation acts like an element-wise multiplication between the kernel (the convolution window that slides) and the corresponding input entries, then it sums the products of the multiplications up to form one entry (as shown above).
In 3D, it would work the same way. The kernel is 3 dimensional (like a rectangular prism), and it will slide across the input, multiplying each element in the tensor with the corresponding elements in the input—and end up summing all products.
The convolution window or kernel itself contains values (parameters) that can be tweaked during the training process, and we usually initialize it randomly.
So for our specific example with the 32×32×3 input, the convolution process will go like this:
There’s a padding of one square around the entire input tensor, and the kernel moves with stride 1, skipping no squares.
Even though the image shows a 2D process, the filter will run through the entire depth of 3 at each position.
To better visualize the 3D aspect of a convolution, please see the image below:
This sliding motion down the entire height and width across the entire depth will produce just 1 singular feature map.
To get all of the 32 feature maps at the output, we will repeat this sliding process for each and every filter, meaning that the convolution window will complete 32 full slides across the image.
With these parameters, we stack the feature maps and end up with an output tensor with a size of 32×32×32.
We pass this transformed tensor from the convolutional layer to a ReLU activation function to add some nonlinearity.
For every feature map created by the convolutional layer, we’ll apply the ReLU function to each entry in each row and column.
So we’ll end up applying the ReLU for each individual value in the output tensor.
In equation form, the tensor after passing through this activation will be like this:
It might look complicated, but it is just saying that for a kth feature map, the activation matrix for that feature map (with superscript 1 as an index representing the block) will be made up of each entry with the ReLU applied.
The H(1) denotes a pre-activation feature map, and the A(1) denotes a post-activation feature map.
The (i, j) notation is better understood in a coding aspect, but it’s indexing every (i,j) location (one cell of the matrix), where i ranges from 0 to the height, and j ranges from 0 to the width.
We’ll then add a pooling layer. In this case, we’ll use max pooling with a 2×2 window and a stride of 2, essentially halving the resolution of the tensor.
The pooling layer won’t affect the depth of our tensor, so we’ll have to pool through all 32 feature maps.
Based on this, we will end up with a tensor of size 16×16×32.
The actual pooling process starts with sliding the pooling window over the input tensor and picking the maximum value of the values in that window (2×2).
As an example, given a 4×4 matrix, the pooling process with a 2×2 window and a stride of 2 will look like this:
This is a 2D representation, but for our case, keep in mind that the pooling window maintains its size of 2×2×1, but it has to repeat this process separately for all 32 feature maps.
The key difference of the computation process between the kernel and the pooling is that the kernel will go through the entire depth of the input, while the pooling will repeat the process separately for each individual feature map.
Since it has a stride of 2, the pooling window doesn’t overlap with any previous numbers of the tensor, it just goes to the next 2×2 segment that hasn’t been pooled yet.
The pooling operation has a mathematical representation like this:
Each element of the max pool output can be calculated like this. For each feature map that goes through the ReLU (denoted by A(1) ), we find the maximum of this region to set as our value in P (our output after pooling).
We end up with a tensor of all the maximums of the input with a size of 16×16×32 (which is half of what we started with).
We’ve now finished our first segment or block of this process! We have one more of these blocks until we flatten the data, pass it to an ANN, and make our prediction.
Block 2
The process will be about the same for the second block of the network.
Convolutional Layer 2 (Conv2)
Activation Function: ReLU.
Pooling Layer 2 (Pool2): Max pooling.
We’ll use another convolutional layer with 64 filters this time to further increase the model’s understanding of the input.
We’ll keep the stride of 1 and padding of 1 to maintain the resolution. This means that the output will have a height of 16 and a width of 16.
We continue to use a 3×3 convolution window, and since the input tensor now has a depth of 32, the window (or kernel) will have dimensions of 3×3×32.
The sliding window during the convolution process will compute the values the same way as mentioned earlier, multiplying with corresponding values, then summing.
The kernel will perform the sliding process 64 times across the input tensor to produce an output tensor of size 16×16×64.
We then apply the ReLU activation function the same way that was mentioned previously; we apply it to every individual entry in the tensor.
Moving on to the second pooling layer, we continue to use a pooling window of 2×2 and a stride of 2 to halve the height and width.
The pooling process is the same as before. For each of the 64 feature maps, we slide the 2×2 pooling window across the 16×16 matrix and take the maximum of the window at each step.
The output tensor from this pooling layer will be 8×8×64 since the pooling layer doesn’t affect the depth.
Flatten Layer
After this second block, we’ll send our data to be flattened using the flatten layer. This ensures that our data is the correct size to feed into our MLP later.
The flatten layer will take our 3D tensor as an input and output a flattened 1D version. It will convert a tensor into a vector.
The size of the vector is simply the number of elements in the tensor, which computes to be 4096 (8×8×64). It is calculated like so:
Where N represents the size of the vector, K is the number of feature maps (depth), H is the height, and W is the width.
We will then pass this vector into a fully connected layer with 512 neurons to learn the data and make predictions.
Fully Connected Layer
The flatten layer serves as the input layer to the fully connected layer (so think of 4096 input neurons), and there will be one weight matrix from the flatten layer to the FC layer and one bias for the FC layer.
For more intuition about weight matrices, layers, and the overall MLP, please check out our issue here.
The weight matrix has a size of 512×4096, and the bias has a size of 512×1.
The mathematical representation of this layer is like so:
Where z represents the output matrix, W represents the weight matrix, Pflat represents the input vector, and b represents the bias vector.
After passing through this layer, we should end up with a matrix with a size of 512×1 (or just a vector of size 512).
This is due to the fact that the weight matrix is a size of 512×4096, and the input vector is a size of 4096×1, resulting in a size of 512×1 after multiplication.
After passing through this layer, we apply the ReLU activation function one last time to give it some nonlinearity.
Output Layer and Softmax
After passing the vector through the activation function, we will pass it to the output layer.
The weight matrix from the fully connected layer to the output layer will have a size of 10×512, and the bias will have a size of 10. This is because there are 10 distinct classifications of each image.
Using the same mathematical transformations as the fully connected layer did, we end up with a vector of size 10 after multiplication and addition of matrices.
Here, z represents the output vector, W are the weights, a represents the vector after being passed through the activation, and b is the bias.
We pass the output vector (z) through a softmax activation function to obtain a probability distribution of the input image belonging to a certain class, which is our final output.
We’ve just covered one forward propagation of an image in a CNN. We’ll repeat this process for the number of images in our batch, then backpropagate the losses to train the network.
If you enjoy our content, consider supporting us so we can keep doing what we do. Please share this with a friend!
Feedback, inquiries, advertising? Send us an email at [email protected].
If you like, you can also donate to our team to push out better newsletters every week!
That’s it for this week’s issue of AI, but simple. See you next week!
—AI, but simple team