AI, But Simple
Posts
Modern Convolutional Neural Network Architectures

Modern Convolutional Neural Network Architectures

AI, But Simple Issue #14

Edwin Dong
August 26, 2024

Modern Convolutional Neural Network Architectures

AI, But Simple Issue # 14

In previous issues, we’ve explored the many different types of neural networks, such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers, among others.

However, we’ve only scratched the surface of what’s possible with these types of networks, and in recent years, they have gotten extremely powerful.

In this issue, we’re going to focus on one of the building blocks of deep learning, an architecture that accelerated the massive growth of neural networks: the Convolutional Neural Network (CNN).

A standard CNN design.

If you don’t know anything about these networks, check out our issue on CNNs to learn the basics before continuing (padding, stride, conv window size, etc.)

Although Convolutional Neural Networks haven’t seen as much work in the past 5 years, they used to be a popular research topic during the 2010s.

As computer vision and image processing became more and more popular during that time, CNNs rose to fame.

Before the 2010s, CNNs existed, but they weren’t considered to be effective or efficient for computer vision and difficult real world image processing tasks.

It was only in 2012 when CNNs started to take off and become a crucial tool in many deep learning engineers’ toolkits.

It was one groundbreaking architecture that proved neural networks learning the features could perform better than manually selected features in computer vision.

This architecture is known as AlexNet, which is an 8 layer CNN that won the ImageNet Large Scale Visual Recognition Challenge 2012 by a large margin, accelerating the adoption of neural networks for image tasks.

AlexNet

AlexNet showed, for the first time, that the features obtained by learning can overtake manually designed features, pushing past the previous paradigm in computer vision.

AlexNet was made possible due to the acceleration and easier access of computing power, as GPUs became more accessible and powerful.

AlexNet’s architecture shown above.

During the competition, AlexNet was trained on the ImageNet dataset (https://www.image-net.org/), which was a fairly large dataset.

The ImageNet dataset contains many different images with various sizes, ranging from 75 x 56 pixels to 4288 x 2848 pixels.
However, there is a process to resize the images to 256 × 256 pixels.

At the time (in 2012), the ImageNet dataset used for the challenge contained 150000 images and 1000 categories (which is why GPUs and computing power were needed).

ImageNet’s many categories.

AlexNet takes its influence from LeNet, which is another CNN but much smaller, reflecting the computing power at the time in 1998.

LeNet compared side-to-side with AlexNet.

But there are many differences between them. Firstly, AlexNet is much deeper than the relatively small LeNet.

AlexNet consists of 8 layers: 5 convolutional layers, 2 fully connected hidden layers, and 1 fully connected output layer.

In AlexNet’s first layer, the convolution window shape is 11×11.

Images in ImageNet are 8 times taller and wider than MNIST images, so objects in ImageNet images take up more pixels, and as such we need a larger convolution window size.

The convolution window shape in the second layer is 5×5, then reduced again to 3×3.

In AlexNet’s architecture, after the first, second, and fifth convolutional layers, the network adds max-pooling layers with a window shape of 3×3 and a stride of 2.

After the final convolutional layer, there are two large fully connected layers with 4096 outputs.

Another particular thing about AlexNet is that it used the ReLU instead of the sigmoid as its activation function. This provided better performance, and problems like vanishing gradients disappeared.

AlexNet also used dropout in the fully connected layers to handle overfitting.

If you want to learn more about regularization techniques like dropout, check out this previous issue.

Another thing to note is that due to the limited memory in early GPUs, the original AlexNet used a dual data stream design so that each of their two GPUs stores only half of the model.

Fortunately, GPU memory is relatively abundant now, so we rarely need to break up models across GPUs these days.

VGG Network

As CNNs became more researched, the design of them became progressively more advanced, with researchers switching from designing in terms of individual neurons to whole layers, then to blocks (repeating patterns of layers).

The idea of using blocks first emerged from the Visual Geometry Group (VGG) at Oxford University in their VGG network (the blocks were named VGG blocks).

AlexNet beside VGGNet and the VGG Block.

Before the VGG block, the basic building block of CNNs was made up of:

A convolutional layer with padding to maintain the resolution
A nonlinearity (activation function) such as a ReLU
A pooling layer such as max-pooling to reduce the resolution

One of the problems with this approach is that the spatial resolution decreases quite rapidly.

In particular, this imposes a limit of convolutional layers on the network before all dimensions are used up (specifically, the max amount of layers is log base 2 of the dimensions).

The key ideas that VGG introduced were to use multiple convolutions before the pooling layer (which downsamples the data) and to group these layers in the form of a block.

The team behind VGG was also interested in whether deep or wide networks perform better, and they showed that deep and narrow networks outperform shallow networks in a detailed report.

And as such, CNNs started to have smaller and smaller convolution windows, and 3×3 convolutions became a gold standard in deep networks.

And so, VGG decided to take the idea of a deep and narrow network, and designed the VGG block.

A VGG block consists of a sequence of convolutions with 3×3 kernels with padding of 1 (maintaining the height and width) followed by a 2×2 max-pooling layer with stride of 2 (halving the height and width).

A 3×3 window size and a stride of 1 with padding 1 will create a feature map with the same size.

So instead of convolution to non-linearity to pooling, we combined convolution to convolution in the VGG block.

VGG decided to use these cleverly designed blocks in their networks to improve image classification performance in their VGG Network.

The VGG Network can be partitioned into two parts: the first consisting mostly of convolutional and pooling layers and the second consisting of fully connected layers.

VGG defines a family of networks rather than just a specific one. To build a specific network, we compose the blocks in a specific way.

Residual Networks (ResNet)

As the trend in neural network design shifts toward even deeper networks, many researchers wondered what design would drastically improve their performance.

Image of GoogLeNet, a neural network architecture designed a few months prior to ResNet.

One tool that they came up with to help very deep networks train faster is batch normalization, a neural network layer that normalizes data for a bowl-like distribution.

We published an issue explaining batch normalization and its uses, and it can be found here.

The Residual Network (ResNet) took this batch normalization layer and a few added elements of their own to design an effective and game-changing model.

This time, the difference between ResNet and the 2 previously mentioned networks is quite large.

One key breakthrough of ResNet’s design is the introduction of the residual block, which has been used in more than just computer vision but in Recurrent Networks and Transformers.

The residual block contains skip connections (residual connections) that allow the input to flow directly to the non-linearity.

This means that the inputs can forward propagate faster through the residual connections across layers.
If adding more layers does not improve performance, the network can learn to ignore those layers and pass the input directly to the next layer.

A sample residual block to show the residual connection.

The implementation of this residual connection is done by adding the input to the transformed data before the non-linearity.

It acts like an added influence of the input to create deeper networks without the risk of downgrading performance.

To explain this officially, the connection provides an option to learn an identity mapping (the output of some layers can be the same as the input).

The full ResNet model combines residual blocks and some additional layers as its design:

A figure of the ResNet-18.

The model shown is a specific model of ResNet, the ResNet-18. ResNet-18 houses 18 layers in total and is a relatively small version.

Larger versions of ResNet can be achieved by piecing together more residual blocks in sequences (such as ResNet-152).

A closer look at the certain residual blocks.

ResNet shares VGGNet’s extended use of 3×3 convolutional layers.

Each residual block has two 3×3 convolutional layers that create feature maps the same size as the input (padding 1, stride 1).

Each convolutional layer is followed by a batch normalization layer and a ReLU activation function.

Then, we have skip connections that go from the input, passing the convolution operations, directly going to the final ReLU activation function.

To wrap this issue up, let’s observe the trend in CNN depth and accuracy, shown below:

The loss of various models in the ILSVRC (ImageNet) competitions with the layers.

If you enjoy our content, consider supporting us so we can keep doing what we do. Please share this with a friend!

If you like, you can also donate to our team to push out better newsletters every week!

That’s it for this week’s issue of AI, but simple. See you next week!

—AI, but simple team

Subscribe to keep reading

This content is free, but you must be subscribed to AI, But Simple to continue reading.

Already a subscriber?Sign in.Not now