Introduction: How CNNs See the World
Convolutional Neural Networks (CNNs) have revolutionized computer vision, enabling automatic image recognition with superhuman accuracy. Unlike fully-connected networks, CNNs exploit the spatial structure of data: nearby pixels tend to be correlated, and local patterns (edges, textures) repeat across different positions in the image.
The key idea behind CNNs is the convolution operation: a filter (kernel) slides across the image extracting local features. Layer after layer, the network builds a hierarchy of increasingly abstract features: from simple edges to complex shapes to complete objects.
What You Will Learn
- The convolution operation: kernels, stride, padding
- Pooling: dimensionality reduction and translation invariance
- Historic architectures: LeNet, AlexNet, VGG, ResNet
- Skip connections and the vanishing gradient problem in deep networks
- Transfer learning: reusing pre-trained models from ImageNet
- Data augmentation to improve model robustness
- Complete PyTorch implementation with training and evaluation
The Convolution Operation
At the heart of a CNN lies the convolutional layer. A small filter (kernel), typically of size 3x3 or 5x5, slides across the image computing the element-wise product between the filter and the underlying image patch. The result is a feature map that highlights the presence of a specific pattern at each position.
Two parameters control the convolution behavior:
- Stride: the step size at which the kernel moves. Stride=1 produces feature maps of the same size, stride=2 halves the dimensions
- Padding: adding zeros around the image borders. "Same" padding maintains original dimensions, "valid" padding reduces them
import torch
import torch.nn as nn
# 2D Convolution: 3 input channels (RGB), 16 output filters, 3x3 kernel
conv_layer = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3,
stride=1, padding=1)
# Input: batch of 4 RGB 32x32 images
x = torch.randn(4, 3, 32, 32)
output = conv_layer(x)
print(f"Input shape: {x.shape}") # [4, 3, 32, 32]
print(f"Output shape: {output.shape}") # [4, 16, 32, 32]
# With stride=2 and padding=1
conv_stride2 = nn.Conv2d(3, 16, kernel_size=3, stride=2, padding=1)
output_s2 = conv_stride2(x)
print(f"Stride 2 output: {output_s2.shape}") # [4, 16, 16, 16]
Pooling: Dimensionality Reduction
After convolution, pooling layers reduce the spatial dimensions of feature maps, decreasing the number of parameters and computational cost. Pooling also introduces a form of translation invariance: small shifts of the object in the image do not change the extracted features.
The two main types are:
- Max Pooling: selects the maximum value in each window. Preserves the most prominent features and is the most widely used type
- Average Pooling: computes the average in each window. Produces smoother features, typically used before the output
Why CNNs Work So Well
CNNs exploit three fundamental properties of images: locality (important features are local), weight sharing (the same filter is applied everywhere, drastically reducing parameters), and translation invariance (a cat is a cat whether at the center or in a corner of the image). These properties make CNNs orders of magnitude more efficient than fully-connected networks for data with spatial structure.
Historic Architectures: From LeNet to ResNet
LeNet-5 (1998)
Designed by Yann LeCun for handwritten digit recognition, LeNet-5 was the first successful CNN. With only 5 layers (2 convolutions + 3 fully connected), it demonstrated that convolutional networks could beat traditional feature engineering methods.
VGG (2014)
VGGNet demonstrated that depth matters: using exclusively stacked 3x3 kernels across 16 or 19 layers, it achieved excellent performance on ImageNet. Two sequential 3x3 filters cover the same receptive field as a single 5x5 filter, but with fewer parameters and more non-linearity.
ResNet (2015)
ResNet (Residual Network) solved the problem of training very deep networks with skip connections: instead of learning a direct transformation F(x), each block learns the residual difference F(x) + x. This allows the gradient to flow directly through layers, making it possible to train networks with 152+ layers.
import torch
import torch.nn as nn
class ResidualBlock(nn.Module):
"""Basic residual block from ResNet"""
def __init__(self, channels):
super().__init__()
self.conv1 = nn.Conv2d(channels, channels, kernel_size=3, padding=1)
self.bn1 = nn.BatchNorm2d(channels)
self.conv2 = nn.Conv2d(channels, channels, kernel_size=3, padding=1)
self.bn2 = nn.BatchNorm2d(channels)
self.relu = nn.ReLU(inplace=True)
def forward(self, x):
identity = x # Skip connection
out = self.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out += identity # Residual: F(x) + x
return self.relu(out)
class SimpleCNN(nn.Module):
"""CNN for CIFAR-10 classification"""
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2), # 32x32 -> 16x16
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2), # 16x16 -> 8x8
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
nn.AdaptiveAvgPool2d((1, 1)) # 8x8 -> 1x1
)
self.classifier = nn.Linear(128, num_classes)
def forward(self, x):
x = self.features(x)
x = x.view(x.size(0), -1)
return self.classifier(x)
model = SimpleCNN()
x = torch.randn(8, 3, 32, 32)
print(f"Output: {model(x).shape}") # [8, 10]
Transfer Learning: Reusing Pre-Trained Models
Training a CNN from scratch on large datasets requires enormous computational resources. Transfer learning solves this problem: take a model pre-trained on a massive dataset (typically ImageNet with 14 million images) and adapt it to your specific task.
The most common strategy involves two phases:
- Feature extraction: freeze the pre-trained model weights and replace only the final classifier
- Fine-tuning: unfreeze some upper layers and re-train with a very low learning rate
import torchvision.models as models
# Load ResNet50 pre-trained on ImageNet
model = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
# Freeze all parameters
for param in model.parameters():
param.requires_grad = False
# Replace the final classifier for 5 classes
num_features = model.fc.in_features
model.fc = nn.Sequential(
nn.Dropout(0.3),
nn.Linear(num_features, 256),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(256, 5) # 5 custom classes
)
# Only the new classifier parameters will be trained
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable:,} / {total:,} parameters")
Data Augmentation: Robustness Through Transformations
Data augmentation is a regularization technique that artificially increases the diversity of the training dataset by applying random transformations to images. Rotations, crops, horizontal flips, and color variations teach the network to be invariant to these transformations.
from torchvision import transforms
# Data augmentation pipeline for training
train_transform = transforms.Compose([
transforms.RandomResizedCrop(224, scale=(0.8, 1.0)),
transforms.RandomHorizontalFlip(p=0.5),
transforms.RandomRotation(15),
transforms.ColorJitter(brightness=0.2, contrast=0.2,
saturation=0.2, hue=0.1),
transforms.RandomAffine(degrees=0, translate=(0.1, 0.1)),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225]),
transforms.RandomErasing(p=0.1)
])
# For validation/test: only resize and normalize
val_transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])
])
Next Steps in the Series
- In the next article we will explore Recurrent Neural Networks (RNNs) and LSTMs for sequence processing
- We will see how LSTMs solve vanishing gradient and model temporal dependencies
- We will implement a sentiment analysis and text generation model







