Introduction: When Data Is Not Enough
In machine learning, data is everything. But often we do not have enough, or the dataset is imbalanced (one class has far more samples than others). Data augmentation is a set of techniques for artificially expanding the dataset, creating new samples from existing ones. The mathematics behind these techniques ranges from geometric transformations to statistical interpolation.
What You Will Learn
- Geometric transformations for images: rotation, flip, scaling
- Mixup and CutMix: interpolation between samples
- SMOTE: synthetic oversampling for minority classes
- Text and time series augmentation
- Generative models: GANs and VAEs for synthetic data
- When augmentation helps and when it hurts
Geometric Transformations for Images
Geometric transformations are the simplest and most intuitive forms of augmentation. Each transformation can be expressed as a transformation matrix applied to pixel coordinates.
Rotation
A rotation by angle \\theta in the 2D plane:
Scaling (Zoom)
General Affine Transformation
Combining rotation, scaling, translation, and shear in homogeneous coordinates:
import numpy as np
def rotate_image(image, angle_deg):
"""Image rotation (simplified for 2D arrays)."""
angle_rad = np.radians(angle_deg)
cos_a, sin_a = np.cos(angle_rad), np.sin(angle_rad)
# Rotation matrix
R = np.array([[cos_a, -sin_a],
[sin_a, cos_a]])
h, w = image.shape[:2]
center = np.array([h/2, w/2])
# Create rotated image
rotated = np.zeros_like(image)
for i in range(h):
for j in range(w):
coords = np.array([i, j]) - center
src = R.T @ coords + center
si, sj = int(round(src[0])), int(round(src[1]))
if 0 <= si < h and 0 <= sj < w:
rotated[i, j] = image[si, sj]
return rotated
def augment_batch(images, labels):
"""Apply random augmentation to a batch."""
augmented_images = []
augmented_labels = []
for img, label in zip(images, labels):
augmented_images.append(img)
augmented_labels.append(label)
# Horizontal flip (50% probability)
if np.random.random() > 0.5:
augmented_images.append(np.fliplr(img))
augmented_labels.append(label)
# Vertical flip (30% probability)
if np.random.random() > 0.7:
augmented_images.append(np.flipud(img))
augmented_labels.append(label)
# Gaussian noise
if np.random.random() > 0.5:
noise = np.random.normal(0, 0.05, img.shape)
augmented_images.append(np.clip(img + noise, 0, 1))
augmented_labels.append(label)
return np.array(augmented_images), np.array(augmented_labels)
# Example
np.random.seed(42)
batch = np.random.rand(4, 8, 8) # 4 images 8x8
labels = np.array([0, 1, 0, 2])
aug_images, aug_labels = augment_batch(batch, labels)
print(f"Original: {batch.shape[0]} images")
print(f"Augmented: {aug_images.shape[0]} images")
Mixup: Interpolation Between Samples
Mixup creates new samples by linearly interpolating between pairs of existing samples (both inputs and labels):
where \\lambda \\sim \\text{Beta}(\\alpha, \\alpha) with \\alpha \\in (0, \\infty). Typically \\alpha = 0.2 (light mixing).
Why it works: Mixup acts as a regularizer, forcing the model to make linear predictions between samples. It reduces overfitting and improves calibration.
import numpy as np
def mixup(X, y, alpha=0.2):
"""Mixup data augmentation."""
n = X.shape[0]
# Lambda from Beta distribution
lam = np.random.beta(alpha, alpha, size=n)
# Random permutation for second sample
indices = np.random.permutation(n)
# For multi-dimensional features, reshape lambda
lam_x = lam.reshape(-1, *([1] * (X.ndim - 1)))
# Interpolation
X_mix = lam_x * X + (1 - lam_x) * X[indices]
y_mix = lam * y + (1 - lam) * y[indices]
return X_mix, y_mix
# Example with one-hot labels
X = np.random.randn(100, 10) # 100 samples, 10 features
y = np.eye(3)[np.random.randint(0, 3, 100)] # One-hot, 3 classes
X_mix, y_mix = mixup(X, y, alpha=0.2)
print(f"Original sample y[0]: {y[0]}")
print(f"Mixup sample y_mix[0]: {np.round(y_mix[0], 3)}")
print(f"Mixup label sum: {y_mix[0].sum():.4f}") # Should be ~1
CutMix: Cut and Paste
CutMix cuts a rectangular region from one image and replaces it with the corresponding region from another. The label is proportional to the area:
where r_w, r_h are the width and height of the cut region, and W, H are the image dimensions.
SMOTE: Oversampling for Minority Classes
SMOTE (Synthetic Minority Over-sampling Technique) generates synthetic samples for the minority class by interpolating between existing samples and their k nearest neighbors:
where \\mathbf{x}_{nn} is a random neighbor among the k-nearest neighbors of \\mathbf{x}_i.
When to use SMOTE: for tabular datasets with class imbalance (fraud, medical diagnoses, anomalies). Do NOT use for images (prefer Focal Loss or class weights) and do NOT apply to the test set (training only).
import numpy as np
from sklearn.neighbors import NearestNeighbors
def smote(X_minority, n_synthetic, k=5):
"""SMOTE: generate synthetic samples from the minority class."""
n_samples = X_minority.shape[0]
k_actual = min(k, n_samples - 1)
# Find k nearest neighbors
nn = NearestNeighbors(n_neighbors=k_actual + 1)
nn.fit(X_minority)
distances, indices = nn.kneighbors(X_minority)
synthetic = []
for _ in range(n_synthetic):
# Choose a random sample
idx = np.random.randint(0, n_samples)
# Choose a random neighbor (exclude self: index 0)
nn_idx = indices[idx, np.random.randint(1, k_actual + 1)]
# Interpolate
lam = np.random.random()
new_sample = X_minority[idx] + lam * (X_minority[nn_idx] - X_minority[idx])
synthetic.append(new_sample)
return np.array(synthetic)
# Imbalanced dataset: 100 class 0, 10 class 1
np.random.seed(42)
X_majority = np.random.randn(100, 5) + np.array([2, 0, 0, 0, 0])
X_minority = np.random.randn(10, 5) + np.array([-2, 0, 0, 0, 0])
# Generate 90 synthetic samples to balance
X_synthetic = smote(X_minority, n_synthetic=90, k=5)
X_balanced = np.vstack([X_majority, X_minority, X_synthetic])
y_balanced = np.array([0]*100 + [1]*10 + [1]*90)
print(f"Original: class 0={100}, class 1={10}")
print(f"Balanced: class 0={100}, class 1={100}")
print(f"Balanced shape: {X_balanced.shape}")
Text Augmentation
For text data, the main techniques are:
- Synonym Replacement: replace words with synonyms
- Random Insertion: insert synonyms at random positions
- Random Swap: swap word positions
- Random Deletion: delete random words with probability p
- Back-Translation: translate to another language and back (EN -> FR -> EN)
Time Series Augmentation
Specific techniques for temporal data:
Jittering (Gaussian Noise)
Time Warping
Distorts the time axis with a random monotone function, speeding up or slowing down portions of the series.
Window Slicing
Extracts random sub-sequences of length w < T and rescales them to the original length.
Generative Models for Synthetic Data
GAN (Generative Adversarial Networks)
A generator G and a discriminator D compete in a min-max game:
G generates fake data from noise z, D tries to distinguish real from fake. At equilibrium, G produces samples indistinguishable from real data.
VAE (Variational Autoencoder)
The VAE loss combines reconstruction and latent space regularization:
The first term is reconstruction quality, the second forces the latent space to be a standard Gaussian, enabling sampling of new data.
When Augmentation Works (and When It Does Not)
Works well:
- Small datasets with few samples per class
- Transformations that preserve semantics (flipping natural images)
- Imbalanced classes (SMOTE for tabular, Focal Loss + augmentation for images)
Does not work or causes harm:
- Transformations that change semantics (rotating digits 6 and 9)
- Overly aggressive augmentation (deforms data beyond recognition)
- Applying augmentation to the test set (data leakage)
- Already abundant and diversified data
Summary
Key Takeaways
- Geometric transformations: rotation, scaling, flip matrices - the basis of image augmentation
- Mixup: \\tilde{x} = \\lambda x_i + (1-\\lambda) x_j - regularization through interpolation
- SMOTE: interpolates between minority class neighbors to balance
- GAN: min-max game to generate realistic synthetic data
- VAE: reconstruction + KL divergence for a sampleable latent space
- Golden rule: augmentation must preserve semantics and never touch the test set
Series Conclusion: with this article, the "Mathematics and Statistics for AI" series concludes. We have covered the foundations from linear algebra to information theory, from optimization to Transformer mathematics. Every concept has been connected to practical ML/AI applications with NumPy implementations. These mathematical foundations will allow you to deeply understand any machine learning algorithm you encounter.







