AI, But Simple
Posts
(Supporter Only) Recurrent Neural Networks, Explained Mathematically

(Supporter Only) Recurrent Neural Networks, Explained Mathematically

AI, But Simple Issue #22

Edwin Dong
October 21, 2024

Recurrent Neural Networks, Explained Mathematically (Supporter Only)

AI, But Simple Issue #22

A Recurrent Neural Network (RNN) is a special type of neural network designed to handle sequential data.

Sequential data is any type of data where the order of the data is important. These can be words, sentences, or time-series data.

At its core, an RNN is trained to take sequential data as an input, and using its hidden state (which acts like its working memory), it transforms the data into a specific output of sequential data.

Unlike feedforward neural networks, RNNs have connections that form cycles, allowing information to be remembered over time.

We’ve gone over RNNs extensively in previous issues; you can read them to learn the theory behind RNNs:

This week, we’re going into the mathematics of simple RNNs, so a base understanding of the concept of an RNN needs to be learned.

Also, for new subscribers or beginners, knowing a bit of context on the mathematical processes of neural networks would be good as well, along with a bit of basic linear algebra (matrix multiplication, vectors, etc.) and calculus.

You can find a math explanation for a simple Multi-Layer Perceptron (MLP), one of the simplest forms of ANNs, here.

Mathematical Formulation

This week, we’re going over a standard RNN for text generation. These types of RNNs will be fed data and be asked to produce the next words to be able to generate text.

Let’s start with an example using a small phrase: “Hello World”.

We’ll use the below vocabulary for the model. This means that the model only has access to these words in training. We can call these individual words tokens.

<START>
Hello
World
<END>

Here, you might be wondering what the words 1 and 4 are. RNNs typically use start and end tokens to denote where a sentence starts and where a sentence ends.

Each word is represented as a one-hot vector:

To train a RNN, we need to feed it a data point: an input sequence. Using the phrase “Hello World” works a little differently using start and end tokens.

We’ll use the start token in the input sequence and the end token in the target sequence to allow the model to learn what the first word of a sequence typically is, but also where a sentence should end off.

Our input and target sequences look like this:

Input Sequence:

Time Step 1: "<START>"
Time Step 2: "Hello"
Time Step 3: "World"

Target Sequence:

Time Step 1: "Hello"
Time Step 2: "World"
Time Step 3: "<END>"

Let’s talk about target values (values in the target sequence). The target values that the RNN uses act just like labels in other networks. In RNNs, if we’re training a text generator, the target value that the RNN will use will simply be the next token or word in the sentence.

For instance, in our example, if our token is the “<Start>” token, we would set the target value to be “Hello”.

This structure of target values and constantly feeding RNNs this type of data, including predicting how to start a sentence (what comes after the start token) and how to end a sentence (where should the end token be placed), is the way they are able to generate sentences word by word.

Model Architecture

Let’s quickly glance over our model architecture:

The weight matrices and biases are included in the diagram; read along further to get an explanation.

Input Size (Matches Vocabulary Size) (N): 4
Hidden Size (H): 2
Output Size (Matches Vocabulary Size) (C): 4

Our vocabulary size is 4, so we have 4 input neurons, as the input vector has a size of 4.

The hidden size refers to the dimensionality of the hidden state vector h_t in the RNN at each time step t. It determines the number of neurons in the hidden layer(s) of the network.

Remember, a RNN’s hidden neurons can unravel over time, and they represent the hidden state.

Recall that the number of time steps needed in total in the hidden state matches the length of our input sequence (so in this case, we would have a total of 3 time steps).

The output size will match the vocabulary size, as the outputs will be one-hot encoded vectors of each of the possible words, which is a vector of size 4.

Parameters

Just a quick reminder: whenever we discuss the dimensions of a matrix, the first number is going to be the rows, and the second the columns.

Let’s go over some of the weight matrices and biases used in our RNN:

Input to Hidden Weights (W_xh; with a size of H*N (2×4))
Hidden to Hidden Weights (W_hh ; with a size of H*H (2×2))
Hidden to Output Weights (W_hy ; with a size of C*H (4×2))
Hidden Layer Bias (b_h ; with a size of H*1 (2×1))
Output Layer Bias (b_c; with a size of C*1 (4×1))
Initial Hidden State (h₀ ; with a size of H*1 (2×1))

Keep in mind that these matrices were just filled with randomly picked but realistic enough values just for this example.

For instance, the initial hidden state is usually all 0s, and weight matrices are usually initialized to small values.

Forwards Pass

We start with the forward propagation by computing the hidden state, outputs, and loss for timesteps 1, 2, and 3.

Timestep 1 (t = 1)

We take our inputs correspondingly from our input sequence and our target outputs correspondingly from our target sequence.

Input Vector:

Target Output:

For our activation function from the input to the hidden layer, we'll use the hyperbolic tangent function (Tanh):

Standard RNNs use the hyperbolic tangent function very often, as using other activations such as ReLU may create such large numbers in certain conditions it can lead to overflow.

Properly placed Tanh functions can avoid this. In the worst case, using Tanh activations will produce saturation and vanishing gradients, which is better than overflow.

If you want to learn more about vanishing gradients and common deep learning problems, feel free to read this issue.

Tanh is used to squish all values between -1 and 1. That allows values to increase or decrease, making the numbers smaller.

Tanh also tends to be a little better than sigmoid at dealing with the vanishing gradient.

Let’s calculate the hidden state using this formula:

The hidden state is calculated by multiplying the input to hidden weight matrix (W_xh) with the input (x_t), and then add it to the hidden to hidden weight matrix (W_hh) multiplied by the previous hidden state(h_t-1), finally adding the hidden layer bias (b_h) and passing the whole sum into an activation function like tanh.

Let’s calculate the hidden state at our current timestep, h₁:

W_xh multiplied with x₁ results in a matrix of size 2×1, since W_xh has a size of 2×4, and x₁has a size of 4×1 (standard matrix multiplication).

The bias has the same size and can directly be added. We pass this matrix through the tanh function to obtain our hidden state matrix at t = 1.

Since h₀ is a vector of zeroes (with size 2×1), it has no effect on the sum, and it’s why we remove it.

Then, we’ll calculate the output at this timestep:

Let’s break this process into parts. The weight matrix multiplied with the hidden state is calculated as follows:

We can then add this result to the bias to obtain our output at timestep 1.

After computing our output vector, we can go ahead and pass it through the softmax activation function, creating a probability distribution for the values.

As a quick refresher, the softmax function creates a distribution with all values adding up to 1, and it can be visualized like this:

The probability of the output being a word in our vocabulary with index i is calculated like so:

We end up calculating probabilities for 4 possible words, where the numerator is simply the exponential of any entry of the output vector, and the denominator is simply the sum of all exponentials of the output vector.

Let’s calculate the probabilities of predicting the next token as token 1, 2, 3, or 4.

]

Each of the probabilities has subscripts that match corresponding words. For instance, p_1,3 is the probability of the word number 3 occurring right after word number 1.

At this time, it is most probable for the RNN to predict that the next word after “<START>” is “<END>”. Although this isn’t the result we’re looking for, we’ll continue to train the model to output sentences that make more sense.

After passing our data through the softmax function, we can calculate the loss for this training example at time step 1.

In our case, as we’re doing multiclass classification, a common loss function to use is cross-entropy loss (used for classification tasks where the output is a probability distribution over classes).

C is the number of classes, which matches our output size. y_t,i^target is the target output for class i at time t (usually one-hot encoded), and p_t,i is the predicted probability for class i at time t, obtained after applying softmax to y_t.

In simple terms, the loss at timestep t is equal to the negative sum of all of the target values multiplied by the log of the predicted values:

Timestep 2 (t = 2)

For time step 2, we complete a similar process:

Input Vector:

Target Value:

Let’s calculate the hidden state for t = 2:

Then, we can go and calculate the output matrix at t = 2:

Calculate W_hy multiplied with h₂:

Then add it to the bias (b_y) to obtain the output:

We then pass the output through a softmax function to obtain the probabilities:

Here, the model believes that the most likely word after “Hello” is “<END>”. Keep in mind that this is still one singular training example; RNNs will get fed much more data.

We get the loss through the same method mentioned previously:

This time, our loss is a little bit less than last time.

Timestep 3 (t = 3)

We go through the same process as timesteps 1 and 2.

Input Vector:

Target Value:

We calculate the hidden state at t = 3, like so:

The output matrix at t = 3 is calculated the same way as the other output matrices are calculated:

We’ll break it into parts, then compute them together:

After obtaining the full output matrix, we can compute the probabilities by passing it through the softmax function.

Again, it’s still the most probable to pick the token “<END>”, but this time after the word “World”.

After computing these probabilities, we compute the loss at t = 3:

At this timestep, the loss is both lower than the first and the second losses.

Let’s compute the total amount of loss for this entire training example over all timesteps:

After timestep 3, we’re finished with the forward pass of the training example. We’ll go over the backwards pass, including Backpropagation Through Time (BPTT), in another issue.

If you enjoy our content, consider supporting us so we can keep doing what we do. Please share this with a friend!

Feedback, inquiries, advertising? Send us an email at [email protected].

If you like, you can also donate to our team to push out better newsletters every week!

That’s it for this week’s issue of AI, but simple. See you next week!

—AI, but simple team