Introduction: How Models Find the Best Weights
Training a machine learning model means finding the parameter values that minimize a loss function. This process is called optimization and is the heart of all deep learning. The most used algorithm is gradient descent, with its many variants that balance speed, stability, and generalization.
What You Will Learn
- Vanilla Gradient Descent and its limitations
- Stochastic Gradient Descent (SGD) and mini-batches
- Momentum and Nesterov Accelerated Gradient
- Adaptive algorithms: RMSprop and Adam
- Learning rate scheduling: warmup, cosine annealing
- Convergence, saddle points, and the loss landscape
Gradient Descent: The Base Algorithm
Gradient descent updates parameters by moving in the direction opposite to the gradient of the loss:
where \\eta is the learning rate (critical hyperparameter) and \\nabla_{\\theta} L is the gradient computed on all training data (batch gradient descent).
Analogy: imagine standing on a hill in the fog, wanting to reach the lowest valley. At each step, you feel the slope under your feet (gradient) and walk in the steepest downhill direction. The learning rate is the step size: too large and you overshoot the valley, too small and it takes forever.
The Learning Rate Problem
The choice of learning rate is crucial:
- \\eta too large: loss oscillates or diverges
- \\eta too small: extremely slow convergence
- \\eta just right: stable convergence to the minimum
import numpy as np
# Objective function: f(x) = x^4 - 3x^2 + 2 (has two minima)
def f(x):
return x**4 - 3*x**2 + 2
def grad_f(x):
return 4*x**3 - 6*x
# Gradient descent with different learning rates
for lr in [0.01, 0.05, 0.1]:
x = 2.0 # starting point
history = [x]
for _ in range(100):
x = x - lr * grad_f(x)
history.append(x)
print(f"lr={lr}: x_final={x:.6f}, f(x)={f(x):.6f}")
Stochastic Gradient Descent (SGD)
Computing the gradient on the entire dataset is expensive. SGD uses a single sample (or a mini-batch) to estimate the gradient:
The estimated gradient is noisy but on average points in the right direction. The noise has a surprising benefit: it helps escape local minima and acts as an implicit regularizer.
In practice, mini-batch SGD is used with batches of 32-256 samples, a compromise between gradient variance and computational efficiency:
Momentum: Accelerating Convergence
Momentum adds a "velocity" to the update, accumulating past gradients as an exponential moving average:
where \\beta \\approx 0.9 controls how much "memory" momentum has.
Intuition: think of a ball rolling down a hill. Without momentum, it stops every time it hits a small bump. With momentum, the ball accumulates speed and rolls over small dips, converging faster to the valley. Momentum reduces oscillations in directions with alternating gradients and accelerates in directions with consistent gradient.
Nesterov Accelerated Gradient (NAG)
NAG improves momentum by computing the gradient at the "future" position (look-ahead):
"First look where I am going, then correct the direction." NAG converges faster than standard momentum and decelerates before minima.
RMSprop: Adaptive Learning Rate
RMSprop adapts the learning rate individually for each parameter, dividing by the square root of the running average of squared gradients:
For parameters with large gradients, the effective learning rate decreases; for parameters with small gradients, it increases. This solves the problem of different scales across features.
Adam: The State-of-the-Art
Adam (Adaptive Moment Estimation) combines the best of momentum and RMSprop, maintaining both a running average of the gradient (first moment) and a running average of squared gradients (second moment):
With bias correction to compensate for zero initialization:
Final update:
Recommended default hyperparameters: \\beta_1 = 0.9, \\beta_2 = 0.999, \\epsilon = 10^{-8}.
import numpy as np
class Adam:
def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
self.m = None
self.v = None
self.t = 0
def update(self, params, grads):
if self.m is None:
self.m = np.zeros_like(params)
self.v = np.zeros_like(params)
self.t += 1
self.m = self.beta1 * self.m + (1 - self.beta1) * grads
self.v = self.beta2 * self.v + (1 - self.beta2) * grads**2
# Bias correction
m_hat = self.m / (1 - self.beta1**self.t)
v_hat = self.v / (1 - self.beta2**self.t)
params -= self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)
return params
# Test: minimize f(x,y) = x^2 + 10*y^2 (elliptic landscape)
def f(params):
return params[0]**2 + 10 * params[1]**2
def grad_f(params):
return np.array([2*params[0], 20*params[1]])
# Compare SGD vs Adam
params_sgd = np.array([5.0, 5.0])
params_adam = np.array([5.0, 5.0])
optimizer = Adam(lr=0.1)
print("Step | SGD f(x) | Adam f(x)")
for step in range(50):
g = grad_f(params_sgd)
params_sgd -= 0.01 * g
g = grad_f(params_adam)
params_adam = optimizer.update(params_adam, g)
if step % 10 == 0:
print(f"{step:4d} | {f(params_sgd):8.4f} | {f(params_adam):8.4f}")
Learning Rate Scheduling
Starting with a fixed learning rate is not optimal. Scheduling strategies adapt \\eta during training:
Step Decay
where \\gamma = 0.1 and s is the number of epochs between each reduction.
Cosine Annealing
Gradually reduces the learning rate following a cosine curve, more aggressive toward the end.
Warmup + Decay
Used in Transformers: starts with a low learning rate that increases linearly for T_w steps (warmup), then decays:
import numpy as np
def cosine_annealing(t, T, eta_min=1e-6, eta_max=1e-3):
return eta_min + 0.5 * (eta_max - eta_min) * (1 + np.cos(t * np.pi / T))
def warmup_cosine(t, warmup_steps, total_steps, eta_max=1e-3):
if t < warmup_steps:
return eta_max * t / warmup_steps
else:
progress = (t - warmup_steps) / (total_steps - warmup_steps)
return eta_max * 0.5 * (1 + np.cos(progress * np.pi))
# Visualization (values)
total_steps = 1000
warmup = 100
print("Step | Cosine LR | Warmup+Cosine LR")
for t in range(0, total_steps, 100):
cos_lr = cosine_annealing(t, total_steps)
warm_lr = warmup_cosine(t, warmup, total_steps)
print(f"{t:4d} | {cos_lr:.6f} | {warm_lr:.6f}")
Saddle Points and Loss Landscape
In high-dimensional spaces (millions of parameters), local minima are rare. The real problem is saddle points: points where the gradient is zero but which are neither minima nor maxima. The probability of a saddle point grows exponentially with dimensionality.
Fortunately, SGD with momentum and Adam can escape saddle points thanks to stochastic noise and accumulated momentum.
Summary and Connections to ML
Key Takeaways
- Gradient Descent: \\theta \\leftarrow \\theta - \\eta \\nabla L - the base algorithm
- SGD: uses mini-batches for efficiency, noise helps generalization
- Momentum: accumulates velocity, overcomes irregularities in the loss surface
- Adam: combines momentum + adaptive learning rate, the DL default
- Learning rate scheduling: warmup + cosine decay is the Transformer standard
- Saddle points: more problematic than local minima in high dimensions
In the Next Article: we will explore information theory. We will cover entropy, cross-entropy (the most used classification loss), KL divergence, and the deep connections with maximum likelihood.







