Introduction: Neural Networks for Sequences
Recurrent Neural Networks (RNNs) are designed to process sequential data: text, time series, audio, action sequences. Unlike feedforward networks that process independent inputs, RNNs maintain a hidden state that acts as memory, allowing the network to consider the context of previous information in the sequence.
However, classic RNNs suffer from the vanishing gradient problem: during training, the gradient attenuates rapidly across time steps, making it impossible to learn long-term dependencies. LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units) solve this problem with sophisticated gating mechanisms.
What You Will Learn
- How RNNs maintain state across sequences
- The vanishing gradient problem and why it limits RNNs
- LSTM: input gate, forget gate, output gate, and cell state
- GRU: a lightweight alternative to LSTM
- Bidirectional RNNs and sequence-to-sequence models
- Practical implementation: text generation and sentiment analysis
RNN: Architecture and Hidden State
An RNN processes a sequence one element at a time, updating its hidden state at each step. This state vector captures a compressed summary of all information seen up to that point. The output at each time step depends on both the current input and the previous hidden state.
Formally, at each time step t the RNN computes:
- h_t = tanh(W_hh * h_(t-1) + W_xh * x_t + b_h): new hidden state combining previous state and current input
- y_t = W_hy * h_t + b_y: output at the current time step
import torch
import torch.nn as nn
# Simple RNN in PyTorch
class SimpleRNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
self.hidden_size = hidden_size
self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x):
# x shape: (batch, seq_len, input_size)
# h0 shape: (1, batch, hidden_size)
h0 = torch.zeros(1, x.size(0), self.hidden_size).to(x.device)
output, hidden = self.rnn(x, h0)
# Use the last hidden state for classification
out = self.fc(hidden.squeeze(0))
return out
# Example: sequence of 20 time steps with 10 features each
model = SimpleRNN(input_size=10, hidden_size=64, output_size=2)
x = torch.randn(8, 20, 10) # batch=8, seq=20, features=10
output = model(x)
print(f"Output: {output.shape}") # [8, 2]
The Vanishing Gradient Problem
The vanishing gradient is the Achilles heel of classic RNNs. During backpropagation through time (BPTT), the gradient is repeatedly multiplied by the weight matrix W_hh at each time step. If the eigenvalues of this matrix are less than 1, the gradient decreases exponentially; if greater than 1, it explodes.
In practice, this means a classic RNN cannot learn dependencies that span more than 10-20 time steps. If the key word to understand the sentiment of a sentence is at the beginning and the output is at the end, the gradient will have nearly vanished before reaching that word.
Why LSTMs Are Needed
LSTMs solve vanishing gradient with an elegant insight: instead of forcing all information through repeated multiplications, they add a separate cell state that serves as an information "highway". Gates control which information to add, forget, or read from the cell state, allowing the gradient to flow unchanged across hundreds of time steps.
LSTM: Long Short-Term Memory
LSTMs, introduced by Hochreiter and Schmidhuber in 1997, solve vanishing gradient with four key components:
The Three Gates
- Forget Gate (f_t): decides which information to discard from the cell state. A sigmoid value between 0 (forget everything) and 1 (keep everything) for each dimension
- Input Gate (i_t): decides which new information to add to the cell state. Combines a sigmoid gate (how much to add) with a tanh candidate vector (what to add)
- Output Gate (o_t): decides which part of the cell state to use as output/hidden state. Filters the cell state through tanh and sigmoid
Cell State
The cell state is the heart of the LSTM. It flows through the temporal chain with only linear operations (multiplication and addition), allowing the gradient to propagate easily. The gates regulate the flow of information into and out of the cell state.
import torch
import torch.nn as nn
class LSTMClassifier(nn.Module):
"""LSTM for sequence classification"""
def __init__(self, vocab_size, embed_dim, hidden_size,
num_layers, num_classes, dropout=0.3):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(
input_size=embed_dim,
hidden_size=hidden_size,
num_layers=num_layers,
batch_first=True,
dropout=dropout if num_layers > 1 else 0,
bidirectional=True
)
self.dropout = nn.Dropout(dropout)
# Bidirectional: hidden_size * 2
self.fc = nn.Linear(hidden_size * 2, num_classes)
def forward(self, x):
embedded = self.dropout(self.embedding(x))
lstm_out, (hidden, cell) = self.lstm(embedded)
# Concatenate forward and backward hidden states
hidden_cat = torch.cat((hidden[-2], hidden[-1]), dim=1)
output = self.fc(self.dropout(hidden_cat))
return output
# Sentiment analysis: vocab 10000, embedding 128, hidden 256
model = LSTMClassifier(
vocab_size=10000,
embed_dim=128,
hidden_size=256,
num_layers=2,
num_classes=2 # positive/negative
)
# Input: batch of 16 sentences, max 50 tokens
x = torch.randint(0, 10000, (16, 50))
output = model(x)
print(f"Output: {output.shape}") # [16, 2]
GRU: A Lighter Alternative
GRUs (Gated Recurrent Units), introduced by Cho et al. in 2014, are a simplified version of LSTMs. They combine the forget and input gates into a single update gate and merge cell state with hidden state, reducing the number of parameters by approximately 25%.
GRUs have two gates:
- Reset Gate (r_t): how much of the old hidden state to ignore when computing the new candidate
- Update Gate (z_t): how much of the old hidden state to keep vs how much of the new candidate to use
In practice, GRUs achieve comparable performance to LSTMs on many tasks with shorter training times. The choice depends on the task: for very long sequences LSTMs tend to be superior, for smaller datasets GRUs may be preferable due to their lower tendency to overfit.
Bidirectional RNNs and Sequence-to-Sequence
Bidirectional
A bidirectional RNN processes the sequence both forward (left to right) and backward (right to left), concatenating the two hidden states. This allows each position to have context from both the past and the future, which is fundamental for tasks like Named Entity Recognition where a word's meaning depends on the full context.
Sequence-to-Sequence (Seq2Seq)
The Seq2Seq architecture uses an encoder RNN to compress the input sequence into a fixed-size context vector, and a decoder RNN to generate the output sequence. This architecture was fundamental for machine translation before the advent of Transformers.
class TextGenerator(nn.Module):
"""Character-by-character text generator"""
def __init__(self, vocab_size, embed_dim, hidden_size, num_layers):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.lstm = nn.LSTM(embed_dim, hidden_size, num_layers,
batch_first=True, dropout=0.2)
self.fc = nn.Linear(hidden_size, vocab_size)
def forward(self, x, hidden=None):
embedded = self.embedding(x)
output, hidden = self.lstm(embedded, hidden)
logits = self.fc(output)
return logits, hidden
def generate(self, start_token, max_len=100, temperature=0.8):
"""Generate text auto-regressively"""
self.eval()
current = start_token.unsqueeze(0).unsqueeze(0)
hidden = None
generated = [start_token.item()]
with torch.no_grad():
for _ in range(max_len):
logits, hidden = self(current, hidden)
logits = logits[:, -1, :] / temperature
probs = torch.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, 1)
generated.append(next_token.item())
current = next_token
return generated
The Advent of Attention
The main limitation of the Seq2Seq model is the bottleneck: all information from the input sequence is compressed into a single fixed vector. For long sequences, this vector fails to capture all the details. The Attention mechanism, introduced by Bahdanau et al. in 2014, solves this problem by allowing the decoder to "look" directly at all encoder positions. This idea led to Transformers, which we will explore in the next article.
Practical Applications
RNNs and LSTMs find applications in numerous domains:
- Sentiment Analysis: classifying the sentiment of reviews, tweets, comments. Bidirectional LSTMs capture complete context
- Time Series Forecasting: predicting stock prices, energy consumption, system metrics. LSTMs excel at capturing seasonal patterns
- Text Generation: generating text character by character or word by word, from chatbots to computational poetry
- Machine Translation: automatic translation with Seq2Seq + Attention architecture (predecessor of Transformers)
- Speech Recognition: audio-to-text conversion, where acoustic sequences are mapped to phoneme and word sequences
Next Steps in the Series
- In the next article we will explore Transformers, the architecture that made RNNs obsolete for most NLP tasks
- We will cover self-attention, multi-head attention, and positional encoding
- We will analyze BERT and GPT: how they revolutionized Natural Language Processing







