Introduction: Measuring Information
Information theory, founded by Claude Shannon in 1948, gives us the tools to quantify uncertainty, measure the amount of information in a message, and evaluate how well a model approximates reality. In machine learning, these concepts appear everywhere: cross-entropy is the default classification loss function, KL divergence is at the heart of VAEs and knowledge distillation.
What You Will Learn
- Information content: surprise as -log(p)
- Entropy: measuring the uncertainty of a distribution
- Cross-entropy: the most used classification loss
- KL divergence: asymmetric distance between distributions
- Mutual information: dependency between variables
- Perplexity and its connections to language models
Information Content: Surprise
The information content (or self-information) of an event with probability p measures how "surprising" that event is:
Intuition: a very probable event (P \\approx 1) carries little information (low surprise). A rare event (P \\approx 0) carries much information (high surprise). The unit in base 2 is the bit: one bit is the information content of a fair coin flip.
Entropy: Average Uncertainty
Entropy is the expected value of information content, or the average surprise of a distribution:
For a continuous distribution:
Fundamental properties:
- H(X) \\geq 0 always (uncertainty is never negative)
- H(X) = 0 only if X is deterministic (one event has probability 1)
- H(X) is maximal for the uniform distribution (maximum uncertainty)
Example: for a fair coin (P(H) = P(T) = 0.5), the entropy is H = -0.5\\log_2(0.5) - 0.5\\log_2(0.5) = 1 bit. For a biased coin with P(H) = 0.99, the entropy is about 0.08 bit: almost no uncertainty, we almost always know the outcome.
import numpy as np
def entropy(probs):
"""Compute entropy in bits (log base 2)."""
probs = np.array(probs)
probs = probs[probs > 0] # Avoid log(0)
return -np.sum(probs * np.log2(probs))
# Fair coin
print(f"Fair coin: H = {entropy([0.5, 0.5]):.4f} bit")
# Biased coin
print(f"Biased coin (0.99): H = {entropy([0.99, 0.01]):.4f} bit")
# Fair 6-sided die (uniform)
print(f"Fair die: H = {entropy([1/6]*6):.4f} bit")
# Loaded die (3 comes up 50%)
probs_loaded = [0.1, 0.1, 0.5, 0.1, 0.1, 0.1]
print(f"Loaded die: H = {entropy(probs_loaded):.4f} bit")
Cross-Entropy: The Classification Loss
Cross-entropy between the true distribution p and the model's predicted distribution q measures how many bits are needed on average to encode data from p using the optimal code for q:
In classification, p is the target distribution (one-hot) and q is the softmax output. For a single sample with label y (one-hot) and prediction \\hat{y}:
For binary classification, it simplifies to binary cross-entropy:
Fundamental connection: minimizing cross-entropy is equivalent to maximizing the log-likelihood of the model. This explains why cross-entropy is the natural classification loss: we are looking for the model that assigns the maximum probability to the observed data.
import numpy as np
def cross_entropy(p, q):
"""Cross-entropy H(p, q) using natural logarithm."""
q = np.clip(q, 1e-15, 1 - 1e-15) # Avoid log(0)
return -np.sum(p * np.log(q))
def binary_cross_entropy(y_true, y_pred):
"""Binary cross-entropy for a single sample."""
y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
return -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
# 3-class classification
y_true = np.array([0, 0, 1]) # Class 3
# Good prediction
y_pred_good = np.array([0.05, 0.05, 0.90])
print(f"Good prediction: CE = {cross_entropy(y_true, y_pred_good):.4f}")
# Mediocre prediction
y_pred_mid = np.array([0.2, 0.3, 0.5])
print(f"Medium prediction: CE = {cross_entropy(y_true, y_pred_mid):.4f}")
# Wrong prediction
y_pred_bad = np.array([0.7, 0.2, 0.1])
print(f"Wrong prediction: CE = {cross_entropy(y_true, y_pred_bad):.4f}")
# Binary cross-entropy
print(f"\nBCE(y=1, pred=0.9) = {binary_cross_entropy(1, 0.9):.4f}")
print(f"BCE(y=1, pred=0.5) = {binary_cross_entropy(1, 0.5):.4f}")
print(f"BCE(y=1, pred=0.1) = {binary_cross_entropy(1, 0.1):.4f}")
KL Divergence: Distance Between Distributions
KL divergence (Kullback-Leibler) measures how much a distribution q differs from a reference distribution p:
Important properties:
- D_{\\text{KL}}(p \\| q) \\geq 0 always (Gibbs' inequality)
- D_{\\text{KL}}(p \\| q) = 0 if and only if p = q
- Not symmetric: D_{\\text{KL}}(p \\| q) \\neq D_{\\text{KL}}(q \\| p)
The relation H(p, q) = H(p) + D_{\\text{KL}}(p \\| q) tells us that cross-entropy is the entropy of p plus the KL divergence. Since H(p) is constant (does not depend on the model), minimizing cross-entropy is equivalent to minimizing KL divergence.
KL Divergence in VAEs
In Variational Autoencoders, the loss includes a KL divergence term that forces the latent distribution to be close to a standard Gaussian:
import numpy as np
def kl_divergence(p, q):
"""KL divergence D_KL(p || q)."""
p = np.array(p, dtype=float)
q = np.array(q, dtype=float)
mask = p > 0
return np.sum(p[mask] * np.log(p[mask] / q[mask]))
# Two distributions over 4 classes
p = np.array([0.25, 0.25, 0.25, 0.25]) # Uniform
q1 = np.array([0.3, 0.2, 0.3, 0.2]) # Slightly different
q2 = np.array([0.9, 0.03, 0.04, 0.03]) # Very different
print(f"KL(p || q1) = {kl_divergence(p, q1):.6f}")
print(f"KL(p || q2) = {kl_divergence(p, q2):.6f}")
# KL asymmetry
print(f"\nKL(p || q1) = {kl_divergence(p, q1):.6f}")
print(f"KL(q1 || p) = {kl_divergence(q1, p):.6f}")
# KL for VAE (Gaussian vs standard normal)
def kl_gaussian(mu, log_var):
"""KL divergence between N(mu, sigma^2) and N(0, 1)."""
return -0.5 * np.sum(1 + log_var - mu**2 - np.exp(log_var))
mu = np.array([0.5, -0.3, 0.1])
log_var = np.array([-0.5, 0.2, -0.1])
print(f"\nKL(N(mu,sigma^2) || N(0,1)) = {kl_gaussian(mu, log_var):.4f}")
Mutual Information
Mutual information measures how much information one random variable provides about another:
If I(X; Y) = 0, the variables are independent. In ML, mutual information is used for: feature selection (selecting the most informative features), clustering evaluation, and as an objective in the InfoNCE loss of contrastive learning.
Perplexity: Evaluating Language Models
Perplexity is a standard metric for evaluating language models. It is defined as the exponential of the average cross-entropy per token:
A perplexity of k means that, on average, the model is "confused" as if choosing uniformly among k options at each step. Lower perplexity means a better model.
Summary and Connections to ML
Key Takeaways
- Entropy H(X): measures uncertainty, maximal for uniform distributions
- Cross-entropy H(p,q): the standard classification loss
- KL divergence: asymmetric distance between distributions, used in VAEs
- Minimizing cross-entropy = maximizing log-likelihood = minimizing KL
- Mutual information: measures dependency, used in feature selection and contrastive learning
- Perplexity: standard language model metric, lower is better
In the Next Article: we will explore PCA and dimensionality reduction. We will see how the covariance matrix, eigenvectors, and SVD allow compressing data while retaining most of the information.







