AI, But Simple
Posts
Transformers and Multi-Head Attention, Mathematically Explained

Transformers and Multi-Head Attention, Mathematically Explained

AI, But Simple Issue #37

Edwin Dong
February 10, 2025

Hello from the AI, but simple team! If you enjoy our content, consider supporting us so we can keep doing what we do.

Our newsletter is no longer sustainable to run at no cost, so we’re relying on different measures to cover operational expenses. Thanks again for reading!

Transformers and Multi-Head Attention, Mathematically Explained (Step-by-Step)

AI, But Simple Issue #37

A transformer neural network model is a type of neural network architecture that has become the foundation for many modern natural language processing (NLP) tasks, such as chatbots, translation, and text generation.

The main innovation behind the transformer is the self-attention mechanism—originally created for encoder–decoder RNNs for machine translation.

The attention mechanism allows them to weigh the importance of different tokens (words or symbols) in an input sequence when making predictions.

Before continuing, it is recommended to understand how transformers work conceptually, as this issue is long and includes a worked example with math. We have an issue introducing transformers below:

Introduction to Transformers and the Attention Mechanism (Intermediate)

Daily News for Curious Minds

Be the smartest person in the room by reading 1440! Dive into 1440, where 4 million Americans find their daily, fact-based news fix. We navigate through 100+ sources to deliver a comprehensive roundup from every corner of the internet – politics, global events, business, and culture, all in a quick, 5-minute newsletter. It's completely free and devoid of bias or political influence, ensuring you get the facts straight. Subscribe to 1440 today.

Math Process

We’re going to be following the original transformer architecture proposed in Vaswani et al.’s paper “Attention is all you need” in 2017. In this specific issue, we’ll be covering the encoder portion of the transformer—the left side.

Dataset, Vocabulary Size, and Encoding

To begin, for demonstration purposes, we will be using a very small dataset to perform numerical calculations visually. Our entire dataset contains just five sentences, all arbitrarily made up.

Although our dataset is cleaned, in real-world scenarios, cleaning a dataset requires a significant amount of resources. In addition, our dataset is quite small, whereas the dataset used for many LLMs is extremely huge (terabytes worth of data).

First, we need to determine the vocabulary size (N), which is the total number of unique words in our dataset.

To do this, we need to break our dataset into individual words. After counting all the words and numbering each unique word, we end up with a vocabulary size of 22 since there are 22 unique words.

Numbering the unique words is the process of encoding the dataset. Above, note that there are 2 duplicates—”is” and “I”. In our process, the same word has the same index. Thus, both instances of “is” have the same index of 7, and both instances of “I" have the same index of “1”.

For our example, we’ll consider every single word as an individual token, although in practice, input sentences are tokenized and have weird rules—certain tokenizers consider punctuation as a token, and numbers can be considered as multiple tokens.

Calculating Embedding

After encoding our entire dataset, it's time to choose our input. Let’s go with a very simple input sentence from our dataset. We’ll use “The cat is _“ and ask the model to predict the next word.

Above, there are many different combinations of sentences the model can make and many next words it can predict (some examples include “very interesting”, “sleeping”, or “drinking coffee”).

We need to create an embedding vector for each individual word in our input sentence. The original paper uses a 512-dimensional embedding vector for each word.

For our purposes, we’ll use a 3-dimensional embedding (much smaller) vector to visualize how the calculation is happening:

The values of the embedding vectors are between 0 and 1 and are filled randomly in the beginning. They will later be updated as our transformer starts learning the meaning of words.

Note that all of our values will be rounded to 3 decimal places to ensure a consistent calculation.

Positional Encoding and Concatenation

After embedding the words in our input, we need to encode position data into our embedding matrix. There are two formulas for positional encoding depending on whether some dimension of some word vector is even or odd. If it is even, we use sin; if it is odd, we use cos.

We mirror the standard sinusoidal positional encoding from Vaswani et al.’s paper.

Above, PE stands for positional encoding, with parameters pos (which is determined by the position of the token/word in the sentence) and i, which represents the i-th dimension (which corresponds to the dimensions of the input embedding matrix).

Our input sentence is "The cat is" and the starting word is "The" with a starting index (POS) of 0, with a total dimension (d) of 3: (d₀, d₁, d₂). We’ll place the pos on the columns and the value of i on the rows.

For dimension i from 0 to 2, let’s calculate the positional encoding for our first word of the input sentence: “The”.

Using the same method, we can calculate the positional encoding for all the words in our input sentence.

Above, e-6 means to multiply by 10^(-6); 4.65e-6 is equivalent to 0.00000465.

After calculating the positional encoding of each word, we add the encoding to our input word embeddings to obtain a position-aware word embedding. This helps the model keep track of where a token finds itself in a sentence.

This resultant matrix from combining both matrices will be used as the input to the encoder portion of the transformer.

Multi-Head Attention

Now, we’ll look at the most crucial part of the transformer—multi-head attention.

A multi-head attention is simply a combined attention block made up of many single-head attentions. How many heads of attention to use is up to arbitrary choice, although the more heads, the better a model deals with more complex text data.

For instance, GPT-3 uses 96 heads of attention in their model to be able to process long sequences and complex vocabulary.

But before understanding multi-head attention, it’s important to understand how single-head attention is computed. A single head of attention is computed like so:

The attention mechanism takes three inputs: queries (Q) and keys (K) of dimension d_k, and values (V) of dimension d_v. Queries, keys, and values are packed into matrices.

Attention is calculated by performing a matrix multiplication between the query (Q) matrix and the transpose of the key (K) matrix, dividing the product by sqrt(d_k), applying a softmax function, and then multiplying the matrix with the value (V) matrix.

Each of the key, query, and value matrices is obtained by multiplying different weight matrices with the transpose of the position-aware word embedding matrix that we computed earlier.

For the weight matrices (K, Q, V), the number of rows must be equal to the number of columns of the transpose position-aware word embedding. This is simply due to matrix multiplication constraints.

The number of columns of the weight matrices can be chosen at random. For our example, let’s use 3 columns to make the calculations easier.

Like any other weight matrix, the values within the matrix are chosen randomly, and we’ll use values between 0 and 1. These values will be updated during learning.

The resultant query, key, and value matrices after matrix multiplication are shown below:

Now that we have all three matrices, let's start computing single-head attention.

According to the attention formula, we’ll multiply the query matrix with the transpose of the key matrix using a standard matrix multiplication (matmul in the diagram).

After the multiplication, we scale the resultant matrix. To do this, we divide the matrix by the square root of the dimension of our embedding vector, which is 3.

The next step of masking is optional—we won't be calculating it for simplicity.

Masking (causal masking) prevents the model from seeing future tokens when making predictions, creating a “left to right” or sequential learning process.
This is useful for text generation, where the model must predict the next word without knowing what comes next.

After skipping the masking step, we pass our scaled matrix into the softmax activation function. The softmax function will be applied row-wise to our matrix.

Softmax creates a probability distribution of values between 0 and 1 that sum to 1. Its definition can be found below.

After the softmax step, we perform another matrix multiplication of the softmax matrix with the value matrix.

Now that we have a matrix that matches the mathematical definition of a single head of attention, what about multi-head attention?

Multi-head attention is made up of many single-head attentions, as mentioned previously. In multi-head attention, once all single-head attentions compute their resultant matrices, they will be concatenated into one big matrix.

Above, multiple instances of single-head attention run in parallel with different query, key, and value weights, finally being combined into one large matrix.

However, whether it's single-head or multi-head attention, the resultant matrix needs to be linearly transformed by being multiplied with a weight matrix. This gets us the final output of the attention mechanism.

Adding and Normalizing

The next step involves adding and normalizing the multi-head attention resultant matrix. The addition acts as a residual connection, and the normalization is a layer normalization process, which helps stabilize training and improve gradient flow.

Once we obtain the output from multi-head attention, we have to add it to our original position-aware word embedding matrix.

We also need to normalize the matrix. To normalize the matrix, we need to compute the mean and standard deviation row-wise for each row.

Above, σ represents the standard deviation and μ represents the mean.

To normalize, we subtract each entry in the matrix by the corresponding mean and divide it by the corresponding standard deviation added to some small error constant (to prevent the denominator from being zero).

Feed Forward Network

After normalizing the matrix, we pass the matrix through a feedforward neural network. For our example, we will be using a very basic network that contains only one linear (or fully connected) layer and one ReLU activation function.

We calculate the linear layer by multiplying the matrix with a random matrix of weights and adding the product to a bias vector that also contains random values. Both of these parameter matrices, like any other parameters, will be updated through training.

After calculating the linear layer, we pass it through a ReLU activation function. This replaces any negative values from the matrix with zeroes:

Adding and Normalizing Again

Once we obtain the resultant matrix from the feedforward network, following the diagram, we add it to the previous “add and norm” matrix, then normalize it using row-wise mean and standard deviation.

The output matrix of this second “add and norm” step will be used for the query and key matrix in one of the multi-head attention mechanisms in the decoder, shown clearly in the original transformer diagram (a line traces from the add and norm to the multi-head attention).

Up to this point, we have mathematically explained the encoder portion of a transformer. Congratulations! In our next issue, we’ll explain the math behind a transformer decoder.

Here’s a special thanks to our biggest supporters:

Sushant Waidande

If you enjoy our content, consider supporting us so we can keep doing what we do. Please share this with a friend!

Feedback, inquiries, advertising? Send us an email at [email protected].

If you like, you can also donate to our team to push out better newsletters every week!

That’s it for this week’s issue of AI, but simple. See you next week!

—AI, but simple team

Subscribe to keep reading

This content is free, but you must be subscribed to AI, But Simple to continue reading.

Already a subscriber?Sign in.Not now