Transformers and the Attention Mechanism

AI, But Simple Issue #19

Hello from the AI, but simple team! If you enjoy our content, consider supporting us so we can keep doing what we do.

Our newsletter is no longer sustainable to run at no cost, so we’re relying on different measures to cover operational expenses. Thanks again for reading!

Transformers and Self-Attention

AI, But Simple Issue #19

The earliest years of the deep learning boom were driven primarily by the Multilayer Perceptron (MLP), Convolutional Neural Network (CNN), and Recurrent Neural Network (RNN) architectures.

Nowadays, a different type of model is being used widely and has become very popular, with examples like ChatGPT, BERT, and Gemini. This model has completely taken the deep learning world by storm, and it is called a transformer.

A transformer model is a type of neural network architecture that has become the foundation for many modern natural language processing (NLP) tasks, such as translation, text generation, and chatbots.

The core idea behind the transformer model is the attention mechanism, an innovation that was originally created for encoder–decoder RNNs for machine translation.

Unlike earlier models like RNNs or LSTMs (Long Short-Term Memory networks), transformers do not process data sequentially.

Instead, they rely on an attention mechanism to process all words (or tokens) in a sentence at the same time, which greatly increases the performance on natural language processing (NLP) tasks.

Here’s a quick timeline on the attention mechanism and transformers:

  • The attention mechanism was first used in 2014 in computer vision.

  • In 2015, we see use of the attention mechanism in machine translation, a standard NLP task.

  • In 2017, the attention mechanism was used in the first transformer network in the famous paper “Attention is all you need.

  • Transformers have since surpassed the performance of RNNs greatly for NLP tasks.

So how did transformers end up dethroning the RNN for NLP tasks? Well, they solved a load of problems the RNN couldn’t handle.

Problem: RNNs encounter issues with long-range data. RNNs do not work well with long text documents.

  • Solution: Transformer networks use many attention blocks, making long-range dependencies have the same likelihood of being taken into account as any other short-range dependencies. So instead of sequence-to-sequence models like RNNs, who only feed the final hidden state, we feed transformers every token to be able to develop a larger picture of the data.

Problem: RNNs suffer from the vanishing and exploding gradient problem.

  • Solution: Transformers rarely have gradient vanishing or explosion. They use skip/residual connections that allow it to access the input (read this issue for more on residual connections).

Problem: RNNs need larger training steps to reach a local/global minima. RNNs can be visualized as an unrolled network that is very deep, as the size of the network depends on the length of the sequence. Overall, the model requires a longer training time.

  • Solution: Transformers require fewer steps to train than an RNN.

Problem: RNNs do not allow parallel computation. They work as sequence models so all the computation in the network occurs sequentially and cannot be parallelized.

  • Solution: No recurrence in the transformer networks allows for parallel computation.

Before proceeding, it is highly recommended that you possess an understanding of RNNs, ANNs, and the intuition behind them. We have some issues explaining them:

Now that you have some understanding, let’s get into the inner workings of a transformer.

Transformers make use of vector embedding spaces to efficiently handle data.

Embedding Space

The embedding space is a high-dimensional space where each word or token in a vocabulary is represented by a vector of numbers.

These vectors are known as word embeddings.

The idea is to capture the semantic meaning (meaning in language) of words in such a way that words with similar meanings or uses are close to each other in this space.

The word “king” will be more related to the words “queen” and “crown” than to the word “cat.”.

Consider the sentence: ”Bailey is a dog and he is six years old.”

If we just consider the word “he” in the sentence, we see that “and” and “is” are the two words in close proximity to it.

But these words do not give the word “he” any context. The words “Bark” and “dog” are much more related to “he” in the sentence.

From this example, we understand that proximity is not always relevant, but context is.

Attention Mechanism

The attention mechanism is a fundamental part of transformer models that allows them to weigh the importance of different tokens (words or symbols) in an input sequence when making predictions.

It assigns varying degrees of importance to different tokens to help the model focus on the most relevant portions of the data, like how certain words have more meaning in a sentence.

The mechanism itself computes importance values between each pair of words; every word looks at every other word to decide which ones are most important to understanding the context.

Query, Key, Value

We can improve the current mechanism if we add some trainable parameters. Thus, the idea of query, key, and value was introduced.

The use of query, key, and value is best learned through an example.

Take the sentence “The cat is in the hat”, for instance.

First, the model converts each word in the sentence into a numerical vector called a word embedding vector (X) without context.

  • These embeddings contain the meaning of each word or token.

So we have our initial embeddings (Xthe, Xcat, and so on):

For each token, the model creates query, key, and value vectors by multiplying the token's embedding (a numerical vector) with weight matrices.

  • Q = X × WQ

    • The query vector represents the word looking for relevant information from other words.

  • K = X × WK

    • The key vector represents what information each word can offer to others.

  • V = X × WV

    • The value holds the actual content or meaning that will be passed along once relevance is determined.

These linear transformations are actually used in multi-head attention, which we will discuss later.

To determine how much attention each word should give to other words, we calculate a score between the query of one word and the keys of all the words in the sentence (denoted as ScoreX1,X2).

We take the dot product between a query vector (Q) and a key vector (K) to obtain this score.

Why do we use the dot product? The dot product for two vectors whose direction is similar is large, while the dot product for two vectors pointing in different directions is small.

For keys close to the same direction as the query, the values will be large, while for keys differing in direction from the query vector, the values would be low.

So in a way, it’s computing a similarity score.

We do this for all words in the sentence. Each word’s query is compared with the keys of all other words. Keep in mind that there are score values between a word and itself, but it is not shown above.

We then feed this score into a softmax activation function to normalize it.

After determining the attention scores, we compute a weighted sum of the value (V) vectors for each word multiplied by the attention scores of each word.

These weighted sums are the final context carrying values for the tokens.

We can pass these through additional layers like fully connected neural networks to produce different outputs.

Transformer Architecture

The transformer makes use of many of these attention blocks to form a model that performs very well in NLP tasks.

The transformer is made up of two large segments: the encoder and the decoder.

Similarly to how autoencoders work, the encoder processes the input data and generates an abstract representation.

  • We have an issue explaining autoencoders here, and it can be helpful in understanding the encoder-decoder architecture of a transformer.

The decoder takes the encoder's output and produces the final result (e.g., translated text).

At the input, embedding and positional encoding are used to embed tokens into numerical vectors and to add information about the position of each token in the sequence.

Focusing in on some specific parts of the transformer, we can see the attention block being used.

  • The specific type of attention block used is called the multi-head attention, and it works about the same way as a regular attention block—we still use query, key, and value vectors for tokens, allowing the model to focus in on certain words.

“Multi-head” means that the model computes attention multiple times (one self-attention mechanism is called a head) in parallel using different weight matrices.

After the multi-head attention, the result is added to the original input (through a residual connection) and then normalized to stabilize training.

In the transformer architecture, after data passes through the attention block, the output for each token goes through a Positionwise Feed-Forward Network (FFN).

The Positionwise FFN consists of two fully connected layers with an activation function (typically ReLU) in between.

The FFN is called "positionwise" because the transformation is applied independently to each position (each token in the sequence). The FFN does not mix information across different tokens at the input.

There’s also another important block to notice in the decoder, and it’s called the masked multi-head attention block.

The masked multi-head attention is similar to the regular multi-head attention, but it ensures that, at each position in the sequence, the model can only access previous positions.

  • This is important for tasks like text generation, where we do not want to see future tokens during training.

This ensures that predictions are generated sequentially (as text usually has an order, or else it would be off).

After multiple layers of these blocks make up the encoder and the decoder, when the data is finally passed through the output, it is passed in another fully connected network to obtain a relevant output (predict the next word).

We repeat this process until the entire output sequence is generated.

Large Language Models (LLMs) and Generation

You might be wondering how transformers used in LLMs can generate sentences or large paragraphs about many topics.

Basically, LLMs are essentially autocomplete mechanisms that can automatically predict (or complete) thousands of tokens.

For example, consider a sentence followed by a masked sentence:

  • My dog, Bailey, knows how to perform many tricks.

  • ______________ (masked sentence)

An LLM can generate probabilities for the masked sentence, including:

Probability

Word(s)

3.1%

For example, he can sit, stay, and roll over.

2.9%

For example, he knows how to sit, stay, and roll over.

A sufficiently large LLM can generate probabilities for paragraphs or entire essays.

You can think of a user's questions to an LLM as the sentence that is given, followed by an imaginary mask.

If you enjoy our content, consider supporting us so we can keep doing what we do. Please share this with a friend!

Feedback, inquiries, advertising? Send us an email at [email protected].

If you like, you can also donate to our team to push out better newsletters every week!

That’s it for this week’s issue of AI, but simple. See you next week!

—AI, but simple team

Subscribe to keep reading

This content is free, but you must be subscribed to AI, But Simple to continue reading.

Already a subscriber?Sign In.Not now