Introduction: How Neural Networks Learn
If linear algebra is the language of machine learning, differential calculus is its learning engine. Every time a model improves its predictions, it does so through a process called gradient descent, which relies entirely on derivatives and gradients. Without calculus, neural networks could not learn.
In this article, we will see how partial derivatives tell us which direction to adjust weights, how the chain rule makes backpropagation possible, and how everything is implemented in practice with NumPy.
What You Will Learn
- Derivatives: the concept of rate of change
- Partial derivatives and the gradient vector
- Chain rule: how to compose derivatives (the heart of backpropagation)
- Computational graphs: forward and backward pass
- Jacobian and Hessian: higher-order information
- Manual backpropagation implementation in NumPy
Derivatives: The Rate of Change
The derivative of a function f(x) at a point tells us how quickly the function value changes when x changes by an infinitesimal amount:
Intuition: the derivative is the slope of the function at a point. If positive, the function is increasing; if negative, decreasing; if zero, we are at a stationary point (minimum, maximum, or saddle point).
Derivatives of common activation functions in deep learning:
Why This Matters: the sigmoid derivative has a maximum of 0.25 (when x = 0). This means that at each layer the gradient is multiplied by a factor of at most 0.25, causing the famous vanishing gradient problem in deep networks. That is why ReLU (derivative = 1 for x > 0) is preferred.
Partial Derivatives and the Gradient
When the function depends on multiple variables (like a loss function depending on all weights), we compute partial derivatives: the derivative with respect to each variable, keeping the others fixed.
For a function f(x_1, x_2, \\ldots, x_n), the gradient is the vector of all partial derivatives:
Crucial intuition: the gradient points in the direction of steepest ascent of the function. To minimize the loss, we move in the opposite direction:
where \\eta is the learning rate and L(\\theta) the loss function. This is the fundamental formula of gradient descent.
import numpy as np
# Example: f(x, y) = x^2 + 3xy + y^2
# Gradient: [2x + 3y, 3x + 2y]
def f(x, y):
return x**2 + 3*x*y + y**2
def gradient_f(x, y):
df_dx = 2*x + 3*y
df_dy = 3*x + 2*y
return np.array([df_dx, df_dy])
# Starting point
x, y = 3.0, 2.0
print(f"f({x}, {y}) = {f(x, y)}")
print(f"Gradient: {gradient_f(x, y)}")
# Gradient descent
lr = 0.1
for step in range(20):
grad = gradient_f(x, y)
x -= lr * grad[0]
y -= lr * grad[1]
if step % 5 == 0:
print(f"Step {step}: x={x:.4f}, y={y:.4f}, f={f(x, y):.6f}")
The Chain Rule: The Heart of Backpropagation
The chain rule is the mathematical principle that makes training deep neural networks possible. If we have composed functions y = f(g(x)), the derivative is:
With multiple composed functions y = f_1(f_2(f_3(x))):
A neural network is exactly a composition of functions: each layer applies a linear transformation followed by a non-linear activation. The chain rule allows us to compute how the loss changes with respect to every weight, traversing all layers in reverse order.
Example: Backpropagation on a Single Neuron
Consider a single neuron with MSE loss:
The gradient with respect to w via the chain rule:
where z = wx + b. Each term in the chain has a precise meaning: the error, the activation sensitivity, and the input.
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_deriv(x):
s = sigmoid(x)
return s * (1 - s)
# Single neuron: forward and backward pass
x = 2.0 # input
y = 1.0 # target
w = 0.5 # weight
b = 0.1 # bias
lr = 0.1
for epoch in range(50):
# Forward pass
z = w * x + b
y_hat = sigmoid(z)
loss = (y - y_hat) ** 2
# Backward pass (chain rule)
dL_dyhat = 2 * (y_hat - y) # dL/d(y_hat)
dyhat_dz = sigmoid_deriv(z) # d(y_hat)/dz
dz_dw = x # dz/dw
dz_db = 1.0 # dz/db
dL_dw = dL_dyhat * dyhat_dz * dz_dw # Full chain rule
dL_db = dL_dyhat * dyhat_dz * dz_db
# Update weights
w -= lr * dL_dw
b -= lr * dL_db
if epoch % 10 == 0:
print(f"Epoch {epoch}: loss={loss:.6f}, w={w:.4f}, b={b:.4f}")
Computational Graphs: Visualizing Forward and Backward
A computational graph represents a function as a tree of elementary operations. Each node performs a simple operation (addition, multiplication, activation) and during the backward pass the gradient flows through the graph in reverse order thanks to the chain rule.
Consider L = (\\sigma(w_1 x_1 + w_2 x_2 + b) - y)^2:
- Forward: z_1 = w_1 x_1, z_2 = w_2 x_2, s = z_1 + z_2 + b, a = \\sigma(s), L = (a - y)^2
- Backward: compute \\frac{\\partial L}{\\partial a}, then \\frac{\\partial L}{\\partial s}, then \\frac{\\partial L}{\\partial w_1} and \\frac{\\partial L}{\\partial w_2}
This is exactly what PyTorch and TensorFlow do automatically with automatic differentiation.
Jacobian and Hessian
The Jacobian generalizes the gradient to vector-valued functions. If \\mathbf{f}: \\mathbb{R}^n \\to \\mathbb{R}^m, the Jacobian is an m \\times n matrix:
The Hessian is the matrix of second derivatives, giving us information about the curvature of the loss function:
The eigenvalues of the Hessian determine whether a critical point is a minimum (all positive), maximum (all negative), or saddle point (mixed). In neural network optimization, saddle points are much more common than local minima.
Full Backpropagation: 2-Layer Network
import numpy as np
np.random.seed(42)
# XOR dataset (not linearly separable)
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])
# Weight initialization
W1 = np.random.randn(2, 4) * 0.5 # (2 inputs, 4 hidden)
b1 = np.zeros((1, 4))
W2 = np.random.randn(4, 1) * 0.5 # (4 hidden, 1 output)
b2 = np.zeros((1, 1))
def sigmoid(x):
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
lr = 1.0
for epoch in range(10000):
# === FORWARD PASS ===
z1 = X @ W1 + b1 # (4, 2) @ (2, 4) = (4, 4)
a1 = sigmoid(z1) # Hidden activation
z2 = a1 @ W2 + b2 # (4, 4) @ (4, 1) = (4, 1)
a2 = sigmoid(z2) # Output
# Loss: MSE
loss = np.mean((y - a2) ** 2)
# === BACKWARD PASS (Chain Rule) ===
m = X.shape[0]
# Output layer gradient
dL_da2 = 2 * (a2 - y) / m
da2_dz2 = a2 * (1 - a2) # Sigmoid derivative
dz2 = dL_da2 * da2_dz2 # (4, 1)
dW2 = a1.T @ dz2 # (4, 4).T @ (4, 1) = (4, 1)
db2 = np.sum(dz2, axis=0, keepdims=True)
# Hidden layer gradient (chain rule continues!)
da1 = dz2 @ W2.T # (4, 1) @ (1, 4) = (4, 4)
dz1 = da1 * (a1 * (1 - a1)) # Sigmoid derivative
dW1 = X.T @ dz1 # (2, 4).T @ (4, 4) = (2, 4)
db1 = np.sum(dz1, axis=0, keepdims=True)
# === WEIGHT UPDATE ===
W2 -= lr * dW2
b2 -= lr * db2
W1 -= lr * dW1
b1 -= lr * db1
if epoch % 2000 == 0:
print(f"Epoch {epoch}: Loss = {loss:.6f}")
# Final result
predictions = np.round(a2, 2)
print(f"\nFinal predictions:\n{predictions.flatten()}")
print(f"Targets: {y.flatten()}")
Gradient Checking: Verifying Gradients
To ensure backpropagation is correctly implemented, we can compare analytical gradients with numerical ones computed via finite differences:
with \\epsilon \\approx 10^{-7}. The relative difference between analytical and numerical gradients should be less than 10^{-5}.
import numpy as np
def numerical_gradient(f, params, idx, epsilon=1e-7):
"""Compute numerical gradient for verification."""
original = params[idx].copy()
params[idx] = original + epsilon
loss_plus = f()
params[idx] = original - epsilon
loss_minus = f()
params[idx] = original
return (loss_plus - loss_minus) / (2 * epsilon)
# Simple example: f = (w*x - y)^2
w = np.array([0.5])
x, y_true = 2.0, 3.0
def compute_loss():
return (w[0] * x - y_true) ** 2
# Analytical gradient
grad_analytical = 2 * (w[0] * x - y_true) * x
# Numerical gradient
grad_numerical = numerical_gradient(compute_loss, [w], 0)
print(f"Analytical: {grad_analytical:.8f}")
print(f"Numerical: {grad_numerical:.8f}")
print(f"Rel diff: {abs(grad_analytical - grad_numerical) / max(abs(grad_analytical), 1e-8):.2e}")
Summary and Connections to ML
Key Takeaways
- Derivative: measures rate of change, indicates the function slope
- Gradient \\nabla L: points in the direction of steepest ascent of the loss
- Gradient descent: \\theta \\leftarrow \\theta - \\eta \\nabla L - move opposite to the gradient
- Chain rule: allows computing gradients through function compositions
- Backpropagation: applying the chain rule to the network's computational graph
- Vanishing gradient: sigmoid has max derivative 0.25, ReLU solves with derivative 1
In the Next Article: we will explore probability and statistics for ML. We will cover Bayes' theorem, distributions, Maximum Likelihood Estimation, and how to quantify uncertainty in predictions.







