Introduction: The Architecture That Changed Everything
The Transformer, introduced in the paper "Attention Is All You Need" (Vaswani et al., 2017), revolutionized deep learning by completely eliminating the recurrence of RNNs. In its place, the self-attention mechanism allows every element of the sequence to directly "look at" all other elements, capturing long-range dependencies without the vanishing gradient bottleneck.
Since its introduction, the Transformer has become the dominant architecture not only in NLP (BERT, GPT, T5) but also in computer vision (Vision Transformer), audio (Whisper), and image generation (DALL-E, Stable Diffusion). Understanding this architecture is essential for anyone working in modern deep learning.
What You Will Learn
- Self-Attention: how each token "looks at" others in the sequence
- Query, Key, Value: the mechanics of attention
- Multi-Head Attention: capturing different patterns simultaneously
- Positional Encoding: how the Transformer knows sequence order
- Complete Encoder-Decoder architecture
- BERT vs GPT: encoder-only vs decoder-only
- Practical implementation with Hugging Face Transformers
Self-Attention: The Heart of the Transformer
Self-attention (or intra-attention) allows every position in the sequence to compute an attention weight with respect to all other positions. This means that to understand the word "bank" in a sentence, the model can directly look at whether the context contains words like "river" (river bank) or "money" (financial bank).
The mechanism is based on three vectors computed for each token:
- Query (Q): represents "what am I looking for" - the question each token asks of others
- Key (K): represents "what do I offer" - the label with which each token presents itself
- Value (V): represents "my content" - the actual information to transmit
The attention score between two tokens is the dot product between the Query of the first and the Key of the second, normalized by the square root of the dimension. After a softmax, these scores weight the Values to produce the output.
import torch
import torch.nn as nn
import math
class ScaledDotProductAttention(nn.Module):
"""Scaled Dot-Product Attention: Attention(Q, K, V)"""
def __init__(self, d_k):
super().__init__()
self.scale = math.sqrt(d_k)
def forward(self, query, key, value, mask=None):
# query, key, value: (batch, seq_len, d_k)
scores = torch.matmul(query, key.transpose(-2, -1)) / self.scale
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attention_weights = torch.softmax(scores, dim=-1)
output = torch.matmul(attention_weights, value)
return output, attention_weights
# Example
batch_size, seq_len, d_model = 2, 10, 64
Q = torch.randn(batch_size, seq_len, d_model)
K = torch.randn(batch_size, seq_len, d_model)
V = torch.randn(batch_size, seq_len, d_model)
attention = ScaledDotProductAttention(d_k=d_model)
output, weights = attention(Q, K, V)
print(f"Output: {output.shape}") # [2, 10, 64]
print(f"Weights: {weights.shape}") # [2, 10, 10]
Multi-Head Attention
A single attention mechanism captures only one type of relationship between tokens. Multi-Head Attention performs attention in parallel with different linear projections (heads), allowing the model to simultaneously capture syntactic, semantic, positional, and coreference relationships.
Each head operates on a sub-dimension of the embedding space: if d_model=512 and we have 8 heads, each head works on d_k=64 dimensions. The results are concatenated and projected through a final linear transformation.
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
assert d_model % num_heads == 0
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, query, key, value, mask=None):
batch_size = query.size(0)
# Linear projections and split into heads
Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
# Scaled dot-product attention
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attn = torch.softmax(scores, dim=-1)
context = torch.matmul(attn, V)
# Concatenate heads and final projection
context = context.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
return self.W_o(context)
mha = MultiHeadAttention(d_model=512, num_heads=8)
x = torch.randn(4, 20, 512) # batch=4, seq=20, dim=512
output = mha(x, x, x)
print(f"MHA output: {output.shape}") # [4, 20, 512]
Positional Encoding
Unlike RNNs that process tokens sequentially, the Transformer processes all tokens in parallel. Without positional information, the model treats the sequence as an unordered set. Positional encoding adds position information to each token's embedding.
The original paper uses sinusoidal functions with different frequencies for each dimension. This scheme has the advantage of generalizing to longer sequences than those seen during training, since sine and cosine functions are periodic.
Why the Transformer Is Better Than RNNs
Three key advantages: (1) Parallelization - all tokens are processed simultaneously, fully leveraging GPUs. (2) Long-range dependencies - every token can directly "look at" any other token, without sequential propagation. (3) Scalability - the architecture scales efficiently to billions of parameters (GPT-3: 175B, GPT-4: ~1.8T estimated), which is impossible with RNNs.
Encoder-Decoder Architecture
The original Transformer has an encoder-decoder architecture:
- Encoder: 6 identical layers, each with Multi-Head Self-Attention + Feed-Forward Network, separated by Layer Normalization and residual connections. Processes the full input
- Decoder: 6 identical layers with Masked Self-Attention (to prevent looking at future tokens), Cross-Attention (to look at encoder output), and Feed-Forward Network. Generates output auto-regressively
BERT vs GPT: Two Philosophies
BERT (Encoder-Only)
BERT (Bidirectional Encoder Representations from Transformers) uses only the encoder. During pre-training, it randomly masks 15% of tokens and predicts them (Masked Language Modeling), learning bidirectional representations. It excels at understanding tasks: classification, NER, question answering.
GPT (Decoder-Only)
GPT (Generative Pre-trained Transformer) uses only the decoder with masked attention. Pre-trained to predict the next token (Causal Language Modeling), it excels at text generation. GPT-3 and GPT-4 demonstrated emergent capabilities as scale increases.
from transformers import pipeline, AutoTokenizer, AutoModel
# Sentiment analysis with Hugging Face pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("This movie was absolutely fantastic!")
print(result) # [{'label': 'POSITIVE', 'score': 0.9998}]
# Text generation with GPT-2
generator = pipeline("text-generation", model="gpt2")
text = generator("Deep learning is", max_length=50, num_return_sequences=1)
print(text[0]['generated_text'])
# BERT embeddings for downstream tasks
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("The transformer architecture is revolutionary",
return_tensors="pt")
outputs = model(**inputs)
# outputs.last_hidden_state: (1, seq_len, 768)
cls_embedding = outputs.last_hidden_state[:, 0, :] # CLS token
print(f"CLS embedding: {cls_embedding.shape}") # [1, 768]
Vision Transformer (ViT) and Beyond
The success of Transformers in NLP inspired their application in other domains. The Vision Transformer (ViT) divides an image into patches (typically 16x16), treats them as tokens, and applies the standard Transformer architecture. Surprisingly, ViT achieves competitive performance with CNNs on large datasets and surpasses them when pre-trained on massive datasets.
Today Transformers are the foundation of: language models (GPT-4, Claude, Llama), image generation (DALL-E, Stable Diffusion), speech recognition (Whisper), robotics (RT-2), and multimodal models (GPT-4V, Gemini). The architecture has proven to be a universal foundation for modern artificial intelligence.
Next Steps in the Series
- In the next article we will explore GANs (Generative Adversarial Networks)
- We will see how two competing networks generate realistic synthetic data
- We will analyze DCGAN, StyleGAN, and the challenges of adversarial training







