- AI, But Simple
- Posts
- Recurrent Neural Networks, Revisited
Recurrent Neural Networks, Revisited
AI, But Simple Issue #15
Recurrent Neural Networks, Revisited
AI, But Simple Issue #15
One of the big problems with the original RNN design is when long sequences are involved, as they struggle to maintain information due to vanishing/exploding gradients.
As RNNs shifted more into language processing, we often had cases where we needed a model that could handle longer sequences, as text data is usually thousands of characters.
Due to this inability to maintain information over long sequences, RNNs tend to “forget” information from earlier parts of a sequence, even if it is crucial.
If you don’t know much about RNNs and want to learn more about them before continuing (time steps, hidden state, etc.), you can check out our issue about them here.
In order to accommodate the upcoming longer sequence length, researchers decided to design a new and complex architecture: the LSTM.
Long Short-Term Memory (LSTM)
Standard RNNs do well with normal sequences, but they often struggle with maintaining information over long sequences due to the problems of vanishing and exploding gradients during backpropagation.
This problem will cause the RNN to end up “losing” important information from earlier parts of the sequence (which makes it difficult to deal with long range data)
One of the first and most successful techniques for addressing vanishing gradients came in the form of the Long Short-Term Memory (LSTM) model in 1997.
LSTMs resemble standard recurrent neural networks, but here each ordinary recurrent node is replaced by a memory cell (also called an LSTM cell).
Each memory cell contains an internal state, a node with a self-connected recurrent edge of fixed weight 1, ensuring that the gradient can pass across many time steps without vanishing or exploding.
The term “Long Short-Term Memory” comes from the idea that recurrent neural networks have long-term memory in the form of weights (encodes general knowledge of the data).
They also have short-term memory in the form of activations, which pass from each node to successive nodes.
Back to the LSTM/memory cell.
A memory cell is a composite unit, built from simpler components called gates in a specific pattern.
The operations between these gates use the element-wise (hadamard) product, and so the gates are multiplicative.
Specifically, LSTM has three key gates:
The input gate, which controls how much of the current input is stored in the internal state;
The forget gate, which decides how much of the previous cell state is retained (whether the internal state should be flushed out);
The output gate, which determines the information to be in the output at each time step.
These gates regulate the flow of information, allowing LSTM networks to maintain and update a memory cell over time.
The data feeding into the LSTM gates are the input (sequential data; for instance, a sequence of words) at the current time step and the hidden state (a function of the previous hidden state and input, which serves as the memory of the RNN, explained in the last issue of RNNs) of the previous time step.
Three fully-connected layers with sigmoid activation functions compute the values of the input, forget, and output gates.
As a result of the sigmoid activation, all values of the three gates are in the range of (0,1).
Additionally, the cell houses an input node, typically computed with a tanh activation function.
Input gates and forget gates provide the model with the ability to decide when to maintain the values of the current cell state and when to modify it in response to specific inputs.
This is the portion that really helps address the vanishing gradient problem.
Finally, we need to compute the output or the hidden state for the memory cell. We can apply tanh to the internal state then do a Hadamard product with the output of the output gate to get the hidden state.
From an memory cell, we can go on to design many useful language processing models.
Note that LSTM networks are built with more than memory cells/LSTM blocks shown above, just like RNNs.
Gated Recurrent Units (GRU)
LSTMs were great, but researchers were on the lookout for new ways to speed up the computation process.
The Gated Recurrent Unit (GRU) offered a simpler and more efficient version of the LSTM memory cell that often achieves comparable performance but with the advantage of being faster to compute.
Here, the LSTM’s three gates are replaced by two: the reset gate and the update gate. Similar to LSTMs, these gates are given sigmoid activations, forcing their values in the range (0,1).
Basically, these are two vectors that decide what information should be passed to the output. They can be trained to keep information from long ago just like the LSTM, making it effective for long sequences.
The reset gate controls how much of the previous state we might still want to remember (like a forget gate).
An update gate helps the model determine how much of the past information from previous time steps needs to be passed along to the future.
This is really powerful because the model can decide to pass all of the past information forward and eliminate the vanishing gradient problem.
The image below illustrates the inputs for both the reset and update gates in a GRU, given the input at the current time step and the hidden state of the previous time step.
Note that the outputs of the gates are similar to the LSTM, where they use fully-connected layers passed into a sigmoid activation function.
Next, we integrate the reset gate with the previous hidden state and the input, leading to the following candidate hidden state.
It is a candidate since we also need to incorporate it with the update gate, as shown in the diagram above.
Then, the next hidden state is finally calculated with the candidate hidden state and the update gate’s contribution summed with the previous hidden state.
That’s pretty much how the Gated Recurrent Unit works, and can be organized with other layers just like the LSTM cell.
An interesting thing to know about the reset gate is that whenever the entries in it are close to 1, we get close to a standard RNN.
Also, for all entries of the reset gate that are close to 0, the candidate hidden state is the result of a feedforward ANN.
In summary, GRUs have the two key features:
Reset gates help capture short-term dependencies in sequences.
Update gates help capture long-term dependencies in sequences.
Having two gates instead of three and a less complex structure allows for faster computation than the LSTM, an upgrade very much needed in the 17 years of difference.
If you enjoy our content, consider supporting us so we can keep doing what we do. Please share this with a friend!
If you like, you can also donate to our team to push out better newsletters every week!
That’s it for this week’s issue of AI, but simple. See you next week!
—AI, but simple team