Introduction: Why Probability Is Fundamental for AI
Machine learning is, at its core, a problem of reasoning under uncertainty. Data is noisy, models are approximations, and predictions are always probabilistic. Probability gives us the mathematical framework to quantify this uncertainty and make informed decisions.
In this article, we will explore from the basics (conditional probability, distributions) to advanced concepts like Bayes' theorem, Maximum Likelihood Estimation, and the comparison between frequentist and Bayesian approaches.
What You Will Learn
- Conditional probability and independence
- Distributions: Gaussian, Bernoulli, Categorical, Poisson
- Bayes' theorem: updating beliefs with data
- Maximum Likelihood Estimation (MLE)
- Maximum A Posteriori (MAP) and the Bayesian approach
- Central Limit Theorem and its implications
Foundations: Probability and Random Variables
The probability of an event A is a number between 0 and 1 that measures how likely that event is: P(A) \\in [0, 1].
The conditional probability of A given B measures the probability of A knowing that B occurred:
Two events are independent if P(A \\cap B) = P(A) \\cdot P(B), meaning knowing one occurred does not change the probability of the other.
Expected Value and Variance
The expected value (mean) of a random variable X:
The variance measures the spread around the mean:
The standard deviation \\sigma = \\sqrt{\\text{Var}(X)} has the same units as X, making it more interpretable.
Fundamental Distributions for ML
Bernoulli Distribution
Models a single experiment with two outcomes (success/failure). Parameter: p (probability of success).
In ML: models binary classification. The output of a neuron with sigmoid is the parameter p of a Bernoulli.
Gaussian (Normal) Distribution
The most important distribution in statistics and ML, parameterized by mean \\mu and variance \\sigma^2:
Why it appears everywhere: by the Central Limit Theorem, the sum of many independent random variables tends to a Gaussian, regardless of the original distribution.
In ML: weight initialization (Gaussian with \\mu = 0), data noise, Gaussian Mixture Models, VAE (variational autoencoders).
Categorical (Multinomial) Distribution
Generalization of Bernoulli to K classes with probabilities p_1, p_2, \\ldots, p_K where \\sum_i p_i = 1:
In ML: the output of softmax models a categorical distribution over K classes.
import numpy as np
from scipy import stats
# Bernoulli: biased coin flip (p=0.7)
bernoulli = stats.bernoulli(p=0.7)
samples = bernoulli.rvs(size=1000)
print(f"Bernoulli - Empirical mean: {samples.mean():.3f} (expected: 0.7)")
# Gaussian: heights (mean=170cm, std=10cm)
gaussian = stats.norm(loc=170, scale=10)
heights = gaussian.rvs(size=1000)
print(f"Gaussian - Mean: {heights.mean():.1f}, Std: {heights.std():.1f}")
# Probability of being between 160 and 180
prob = gaussian.cdf(180) - gaussian.cdf(160)
print(f"P(160 < X < 180) = {prob:.4f}")
# Categorical: 6-sided die
probs = np.array([1/6] * 6)
categorical_samples = np.random.choice(6, size=1000, p=probs) + 1
print(f"Die - Mean: {categorical_samples.mean():.2f} (expected: 3.5)")
Bayes' Theorem: Updating Beliefs
Bayes' theorem is one of the most powerful tools for probabilistic reasoning. It allows us to update our belief about the probability of a hypothesis after observing data:
where:
- P(\\theta | D) - Posterior: updated belief after data
- P(D | \\theta) - Likelihood: how probable the data is given the model
- P(\\theta) - Prior: initial belief before data
- P(D) - Evidence: marginal probability of data (normalization constant)
Intuition: Bayes tells us to start with an initial belief (prior), observe data (likelihood), and combine the two to obtain an updated belief (posterior). The more data we observe, the more the posterior is dominated by the likelihood and less by the prior.
Practical Example: Naive Bayes Classifier
import numpy as np
# Dataset: email spam detection
# Features: count of words "free", "offer", "hello"
X_train = np.array([
[3, 2, 0], # spam
[4, 3, 1], # spam
[0, 0, 3], # not-spam
[1, 0, 4], # not-spam
[5, 4, 0], # spam
[0, 1, 2], # not-spam
])
y_train = np.array([1, 1, 0, 0, 1, 0]) # 1=spam, 0=not-spam
# Naive Bayes: P(spam|features) proportional to P(features|spam) * P(spam)
# Compute prior
n_spam = np.sum(y_train == 1)
n_ham = np.sum(y_train == 0)
p_spam = n_spam / len(y_train)
p_ham = n_ham / len(y_train)
print(f"P(spam) = {p_spam:.3f}, P(ham) = {p_ham:.3f}")
# Compute mean and variance per feature (Gaussian likelihood)
spam_mean = X_train[y_train == 1].mean(axis=0)
spam_var = X_train[y_train == 1].var(axis=0) + 1e-6
ham_mean = X_train[y_train == 0].mean(axis=0)
ham_var = X_train[y_train == 0].var(axis=0) + 1e-6
def gaussian_log_likelihood(x, mean, var):
return -0.5 * np.sum(np.log(2 * np.pi * var) + (x - mean)**2 / var)
# Classify new email
x_new = np.array([4, 3, 0])
log_p_spam = np.log(p_spam) + gaussian_log_likelihood(x_new, spam_mean, spam_var)
log_p_ham = np.log(p_ham) + gaussian_log_likelihood(x_new, ham_mean, ham_var)
print(f"Log P(spam|x) proportional to: {log_p_spam:.4f}")
print(f"Log P(ham|x) proportional to: {log_p_ham:.4f}")
print(f"Classification: {'SPAM' if log_p_spam > log_p_ham else 'HAM'}")
Maximum Likelihood Estimation (MLE)
MLE finds the model parameters that make the observed data most probable. Given a series of independent observations D = \\{x_1, \\ldots, x_n\\}, the likelihood is:
In practice, we work with the log-likelihood (turns products into sums):
To find the maximum, we compute the derivative and set it to zero: \\frac{d\\ell}{d\\theta} = 0.
Example: MLE for a Gaussian
For Gaussian data, the MLE parameter estimates are:
The sample mean and variance are the MLE estimates. The connection to ML is deep: minimizing cross-entropy loss is equivalent to maximizing the log-likelihood.
Maximum A Posteriori (MAP)
MAP adds a prior to the parameters, combining likelihood and prior:
With a Gaussian prior P(\\theta) \\sim \\mathcal{N}(0, \\sigma_p^2), the term \\log P(\\theta) becomes an L2 penalty on the weights: this is why L2 regularization (Ridge) is equivalent to a Gaussian prior on weights.
import numpy as np
from scipy.optimize import minimize_scalar
# Observed data (heights in cm)
data = np.array([168, 172, 175, 170, 173, 169, 171, 174, 176, 170])
# MLE: sample mean and variance
mu_mle = np.mean(data)
sigma2_mle = np.var(data)
print(f"MLE: mu={mu_mle:.2f}, sigma^2={sigma2_mle:.2f}")
# Log-likelihood function
def neg_log_likelihood(mu, data=data, sigma2=sigma2_mle):
n = len(data)
return 0.5 * n * np.log(2 * np.pi * sigma2) + np.sum((data - mu)**2) / (2 * sigma2)
# MAP with Gaussian prior: mu ~ N(170, 5^2)
prior_mu = 170
prior_sigma2 = 25
def neg_log_posterior(mu):
nll = neg_log_likelihood(mu)
neg_log_prior = (mu - prior_mu)**2 / (2 * prior_sigma2)
return nll + neg_log_prior
result_mle = minimize_scalar(neg_log_likelihood, bounds=(150, 190), method='bounded')
result_map = minimize_scalar(neg_log_posterior, bounds=(150, 190), method='bounded')
print(f"MLE mu: {result_mle.x:.4f}")
print(f"MAP mu: {result_map.x:.4f} (shrunk toward prior {prior_mu})")
Central Limit Theorem
The Central Limit Theorem (CLT) states that the sum (or average) of many independent random variables, regardless of their original distribution, tends to a Gaussian distribution:
This explains why the Gaussian appears everywhere: neural network weight = sum of many small updates, sensor noise = sum of many small perturbations, etc.
Summary and Connections to ML
Key Takeaways
- Bayes: P(\\theta|D) \\propto P(D|\\theta) P(\\theta) - updates beliefs with data
- Gaussian: the most common distribution, appears everywhere thanks to CLT
- MLE: finds parameters that maximize the probability of observed data
- MAP: MLE + prior, equivalent to regularization
- Cross-entropy loss = negative log-likelihood: the fundamental connection
- L2 regularization = Gaussian prior: the probabilistic connection
In the Next Article: we will explore optimization for ML. We will cover Gradient Descent, SGD, Adam, momentum, and learning rate scheduling strategies that determine whether a model converges or diverges.







