- AI, But Simple
- Posts
- Transformers and Self-Attention, Mathematically Explained
Transformers and Self-Attention, Mathematically Explained
AI, But Simple Issue #38

Hello from the AI, but simple team! If you enjoy our content, consider supporting us so we can keep doing what we do.
Our newsletter is no longer sustainable to run at no cost, so we’re relying on different measures to cover operational expenses. Thanks again for reading!
Transformers and Self-Attention, Mathematically Explained (Part 2)
AI, But Simple Issue #38
This week’s issue is a continuation of last week’s topic on transformers, exploring the inner workings and the math computations within the most popular model for NLP tasks. The first part can be found below:
Here’s a quick refresher on transformers: A transformer model is a type of neural network that is popularly used for many modern natural language processing (NLP) tasks for its state-of-the-art performance.
Some examples of transformers used in the world are chatbots (such as ChatGPT), translation tools, and text generation.
The key behind why the transformer works so well for language is the self-attention mechanism—originally created for RNNs and machine translation.
The attention mechanism allows them to weigh the importance of different tokens (words or symbols) in an input sequence when making predictions.
This weighting lets the model understand the context of individual words better—capturing relationships between words in longer texts.
Last week, we went over a simple worked example using a small vocabulary and a 3-word sentence: “the cat is”. In that issue, we covered the computations within the transformer encoder.
This week, we’ll be discussing the transformer decoder, covering all the math and computations.
Again, before continuing, it is recommended to understand how transformers work conceptually, as this issue is long and includes a worked example with math. We have an issue introducing transformers below:
Transformer Decoder
Take a look at our transformer architecture below. We already covered the encoder (left side); now we’re going over the decoder (right side):

In our encoder, we only performed unique computations, from encoding our dataset to passing our matrix through a feedforward network.
In the decoder, the good thing is that many of the computations are the same, which we’ve already covered in the encoder.
To avoid repetitive steps, we won’t be repeating them in full but instead focusing on some key differences between the encoder and decoder.
We’ll focus on the calculations of the input and output of the decoder, as well as some specific blocks in the center of the decoder.
Decoder Input Streams
The decoder takes two main streams of data as input.
The first is from the encoder, where the output matrix of the last “add and norm” layer is multiplied with different query (Q) and key (K) weights to act as the key and query matrices for the second multi-head attention layer in the decoder:

As with the other multi-head attention blocks, in addition to the key (K) and query (Q) matrices, a value (V) matrix is also used.