Introduction: The Transformer Revolution
Transformers, introduced in the paper "Attention Is All You Need" (2017), have revolutionized AI. From GPT to BERT, from DALL-E to Stable Diffusion, from Claude to Gemini: all are based on the Transformer architecture. The secret of their success is the self-attention mechanism, which captures long-range dependencies without the limitations of recurrent networks.
What You Will Learn
- Query, Key, Value: the three linear projections
- Scaled Dot-Product Attention: the central formula
- Why divide by the square root of d
- Multi-Head Attention: multiple perspectives
- Positional Encoding: adding order without recurrence
- Step-by-step NumPy implementation
Q, K, V Projections: Three Perspectives on Data
The attention mechanism operates on three linear transformations of the input \\mathbf{X} \\in \\mathbb{R}^{n \\times d_{\\text{model}}}:
where \\mathbf{W}^Q, \\mathbf{W}^K \\in \\mathbb{R}^{d_{\\text{model}} \\times d_k} and \\mathbf{W}^V \\in \\mathbb{R}^{d_{\\text{model}} \\times d_v}.
Intuition:
- Query (\\mathbf{Q}): "what am I looking for?" - the question each token asks
- Key (\\mathbf{K}): "what do I offer?" - each token's label
- Value (\\mathbf{V}): "what is my content?" - the actual information
The attention mechanism computes similarity between each Query and all Keys, uses these similarities as weights, and combines the corresponding Values.
Scaled Dot-Product Attention
The central formula of the Transformer:
Step by step:
- \\mathbf{Q} \\mathbf{K}^T \\in \\mathbb{R}^{n \\times n}: similarity matrix between all tokens (dot product)
- Division by \\sqrt{d_k}: scaling for numerical stability
- Softmax per row: converts scores to weights (probabilities) summing to 1
- Multiplication by \\mathbf{V}: weighted average of Values
Why the Scaling Factor?
Without \\sqrt{d_k}, the dot product grows with vector dimension. If q and k have i.i.d. components with mean 0 and variance 1, then:
For large d_k, the dot product would have very large or very small values, pushing softmax into saturation regions (near-zero gradients). Dividing by \\sqrt{d_k} brings the variance back to 1.
import numpy as np
def softmax(x, axis=-1):
exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
return exp_x / np.sum(exp_x, axis=axis, keepdims=True)
def scaled_dot_product_attention(Q, K, V, mask=None):
"""Scaled Dot-Product Attention."""
d_k = Q.shape[-1]
# 1. Compute similarity scores
scores = Q @ K.T
scores = scores / np.sqrt(d_k)
# 2. Apply mask (optional, for decoder)
if mask is not None:
scores = np.where(mask == 0, -1e9, scores)
# 3. Softmax to get attention weights
attention_weights = softmax(scores)
# 4. Weighted average of Values
output = attention_weights @ V
return output, attention_weights
# Example: sequence of 4 tokens, dimension 8
np.random.seed(42)
seq_len, d_model, d_k = 4, 8, 8
X = np.random.randn(seq_len, d_model)
# Projection matrices
W_Q = np.random.randn(d_model, d_k) * 0.1
W_K = np.random.randn(d_model, d_k) * 0.1
W_V = np.random.randn(d_model, d_k) * 0.1
# Projections
Q = X @ W_Q
K = X @ W_K
V = X @ W_V
# Attention
output, weights = scaled_dot_product_attention(Q, K, V)
print(f"Input shape: {X.shape}")
print(f"Attention weights:\n{np.round(weights, 3)}")
print(f"Output shape: {output.shape}")
Multi-Head Attention: Multiple Perspectives
Instead of a single attention mechanism, Transformers use h parallel heads, each with its own projection matrices:
Each head has dimension d_k = d_{\\text{model}} / h. With 8 heads and d_{\\text{model}} = 512, each head operates in a 64-dimensional space.
Why multiple heads? Each head can capture a different type of relationship: one might focus on syntax, another on semantics, another on co-reference.
import numpy as np
def multi_head_attention(X, n_heads, d_model):
"""Simplified Multi-Head Attention implementation."""
d_k = d_model // n_heads
seq_len = X.shape[0]
heads_output = []
all_weights = []
for h in range(n_heads):
W_Q = np.random.randn(d_model, d_k) * 0.1
W_K = np.random.randn(d_model, d_k) * 0.1
W_V = np.random.randn(d_model, d_k) * 0.1
Q = X @ W_Q
K = X @ W_K
V = X @ W_V
# Scaled dot-product attention
scores = Q @ K.T / np.sqrt(d_k)
weights = softmax(scores)
head_out = weights @ V
heads_output.append(head_out)
all_weights.append(weights)
# Concatenate heads
concat = np.concatenate(heads_output, axis=-1) # (seq_len, d_model)
# Output projection
W_O = np.random.randn(d_model, d_model) * 0.1
output = concat @ W_O
return output, all_weights
np.random.seed(42)
seq_len, d_model, n_heads = 6, 16, 4
X = np.random.randn(seq_len, d_model)
output, weights = multi_head_attention(X, n_heads, d_model)
print(f"Input: {X.shape}, Output: {output.shape}")
print(f"Number of heads: {n_heads}, d_k per head: {d_model // n_heads}")
Positional Encoding: Order Without Recurrence
Attention is permutation-invariant: it does not distinguish token order. To add positional information, positional encoding based on sinusoidal functions is used:
Why sine and cosine? Because PE_{pos+k} can be expressed as a linear transformation of PE_{pos}, allowing the model to easily learn to "look" at fixed relative distances.
import numpy as np
def positional_encoding(max_len, d_model):
"""Sinusoidal Positional Encoding."""
PE = np.zeros((max_len, d_model))
position = np.arange(max_len)[:, np.newaxis]
div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
PE[:, 0::2] = np.sin(position * div_term) # Even indices: sine
PE[:, 1::2] = np.cos(position * div_term) # Odd indices: cosine
return PE
# Generate positional encoding
max_len, d_model = 100, 64
PE = positional_encoding(max_len, d_model)
print(f"PE shape: {PE.shape}")
print(f"PE[0, :8]: {np.round(PE[0, :8], 4)}")
print(f"PE[1, :8]: {np.round(PE[1, :8], 4)}")
# Final embedding is: token_embedding + positional_encoding
token_embedding = np.random.randn(10, d_model) # 10 tokens
input_with_position = token_embedding + PE[:10]
print(f"\nEmbedding + PE shape: {input_with_position.shape}")
Layer Normalization and Residual Connections
Each Transformer sub-layer uses residual connections and layer normalization:
Layer normalization normalizes along the feature dimension:
where \\mu and \\sigma are computed per sample, and \\gamma, \\beta are learnable parameters.
Summary
Key Takeaways
- Attention: \\text{softmax}(QK^T / \\sqrt{d_k}) V - weighted similarity between tokens
- Scaling \\sqrt{d_k}: prevents softmax saturation
- Multi-head: h parallel heads capture different relationships
- Positional Encoding: sine/cosine add sequential order
- Residual + LayerNorm: stabilize training of deep networks
- The entire architecture is composed of linear algebra operations and softmax
In the Next Article: we will explore data augmentation and synthetic data generation techniques. SMOTE, Mixup, augmentation for images and text, and when augmentation actually helps.







