Vision Transformers (ViTs)

AI, But Simple Issue #32

Hello from the AI, but simple team! If you enjoy our content, consider supporting us so we can keep doing what we do.

Our newsletter is no longer sustainable to run at no cost, so we’re relying on different measures to cover operational expenses. Thanks again for reading!

Transformers For Computer Vision

AI, But Simple Issue #32

The Transformer neural network architecture was initially developed to tackle machine translation problems but quickly became the top choice in various natural language processing (NLP) tasks due to its impressive performance.

As the Transformer revolutionized NLP, researchers began to question whether its success could reach into other domains, such as computer vision.

In the field of computer vision, the Convolutional Neural Network (CNN) has long been the dominant architecture. However, researchers started to wonder if it might be possible to do better by adapting Transformer models to image data.

This research paved the way to Vision Transformers (ViTs) (Dosovitskiy et al., 2021), a transformer model that was “re-engineered” for computer vision tasks.

Essentially, a Vision Transformer (ViT) is just an adaptation of the Transformer architecture with some modifications for processing the input and generating the output.

Subscribe to keep reading

This content is free, but you must be subscribed to AI, But Simple to continue reading.

Already a subscriber?Sign In.Not now