Introduction: From Sample to Population
Inferential statistics allows us to draw conclusions about an entire population by observing only a sample. In ML, this is fundamental: we train on a training set (sample) and want the model to work on unseen data (population). Confidence intervals, hypothesis tests, and A/B testing are indispensable tools for every data scientist.
What You Will Learn
- Standard error and sampling distribution
- Confidence intervals: what they really mean
- Hypothesis testing: null hypothesis, p-value, Type I and Type II errors
- T-test and Chi-square test
- A/B testing: setup, power analysis, early stopping
- Effect size and why the p-value is not enough
Sampling Distribution and Standard Error
The sampling distribution of the mean describes how the sample mean varies if we repeat the experiment many times. By the CLT:
The standard error (SE) is the standard deviation of the sampling distribution:
where s is the sample standard deviation. SE decreases with \\sqrt{n}: to halve the uncertainty you need 4 times more data.
Confidence Intervals
A 95% confidence interval for the mean is:
With small samples (n < 30), Student's t-distribution is used instead of the normal:
What it really means: a 95% CI does NOT mean "there is a 95% probability the true value is in the interval." It means: if we repeated the experiment infinitely many times, 95% of the calculated intervals would contain the true value. The difference is subtle but crucial.
import numpy as np
from scipy import stats
# Sample of accuracies from 10 experiments
accuracies = np.array([0.92, 0.89, 0.91, 0.93, 0.90, 0.88, 0.91, 0.94, 0.90, 0.92])
n = len(accuracies)
mean = np.mean(accuracies)
se = stats.sem(accuracies) # Standard Error
# 95% CI with t-distribution
t_critical = stats.t.ppf(0.975, df=n-1)
ci_lower = mean - t_critical * se
ci_upper = mean + t_critical * se
print(f"Mean: {mean:.4f}")
print(f"SE: {se:.4f}")
print(f"95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")
# Quick method with scipy
ci = stats.t.interval(0.95, df=n-1, loc=mean, scale=se)
print(f"95% CI (scipy): [{ci[0]:.4f}, {ci[1]:.4f}]")
Hypothesis Testing
A hypothesis test evaluates whether observed data is compatible with a hypothesis:
- H_0 (null hypothesis): no effect (e.g., two models have the same accuracy)
- H_1 (alternative hypothesis): there is an effect
Test Statistic and P-Value
To compare a sample mean with a known value, the t-statistic is computed:
The p-value is the probability of observing a result this extreme (or more) if H_0 were true. If p < \\alpha (typically 0.05), we reject H_0.
Type I and Type II Errors
The power of a test is 1 - \\beta: the probability of detecting a real effect.
Two-Sample T-Test
To compare two models, we use the independent two-sample t-test:
import numpy as np
from scipy import stats
# Model A vs Model B: accuracies over 15 runs
np.random.seed(42)
model_a = np.array([0.92, 0.89, 0.91, 0.93, 0.90, 0.88, 0.91, 0.94,
0.90, 0.92, 0.91, 0.89, 0.93, 0.90, 0.91])
model_b = np.array([0.94, 0.93, 0.95, 0.92, 0.94, 0.91, 0.93, 0.95,
0.93, 0.94, 0.92, 0.93, 0.94, 0.93, 0.94])
# Two-sample t-test
t_stat, p_value = stats.ttest_ind(model_a, model_b)
print(f"Model A: mean={model_a.mean():.4f}, std={model_a.std():.4f}")
print(f"Model B: mean={model_b.mean():.4f}, std={model_b.std():.4f}")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.6f}")
print(f"Significant (alpha=0.05): {p_value < 0.05}")
Effect Size: Beyond the P-Value
The p-value tells us whether an effect is statistically significant, but not how large it is. Effect size (Cohen's d) measures the magnitude:
Interpretation: d \\approx 0.2 small, d \\approx 0.5 medium, d \\approx 0.8 large.
A/B Testing for ML
A/B testing compares two variants (A = control, B = treatment) to determine which performs better. The setup requires:
- Define the metric: click-through rate, conversion, accuracy
- Calculate the required sample size (power analysis)
- Randomize users into groups
- Collect data for the predetermined duration
- Analyze with the appropriate statistical test
Power Analysis: How Many Samples Are Needed?
Power analysis calculates the required sample size to detect an effect of size d with power 1-\\beta:
import numpy as np
from scipy import stats
def power_analysis(effect_size, alpha=0.05, power=0.8):
"""Calculate required sample size per group."""
z_alpha = stats.norm.ppf(1 - alpha/2)
z_beta = stats.norm.ppf(power)
n = ((z_alpha + z_beta) / effect_size) ** 2
return int(np.ceil(n))
# Scenario: detect a 2% accuracy improvement
# Base accuracy: 90%, target: 92%, estimated std: 5%
effect_size = 0.02 / 0.05 # Cohen's d = 0.4
n_per_group = power_analysis(effect_size)
print(f"Effect size (Cohen's d): {effect_size:.2f}")
print(f"Sample size per group: {n_per_group}")
# Simulated A/B test
np.random.seed(42)
n = n_per_group
group_a = np.random.normal(0.90, 0.05, n) # Control
group_b = np.random.normal(0.92, 0.05, n) # Treatment
t_stat, p_value = stats.ttest_ind(group_a, group_b)
diff = group_b.mean() - group_a.mean()
s_pooled = np.sqrt((group_a.var() + group_b.var()) / 2)
cohens_d = diff / s_pooled
print(f"\nA/B Test Results:")
print(f" Group A: {group_a.mean():.4f}")
print(f" Group B: {group_b.mean():.4f}")
print(f" Difference: {diff:.4f}")
print(f" p-value: {p_value:.6f}")
print(f" Cohen's d: {cohens_d:.4f}")
print(f" Significant: {p_value < 0.05}")
Multiple Testing Correction
When running many tests simultaneously (e.g., comparing 10 models), the probability of at least one false positive increases. Bonferroni correction divides the significance level by the number of tests:
where m is the number of tests. It is conservative; for a less strict approach, the Benjamini-Hochberg procedure controls the False Discovery Rate (FDR).
Summary
Key Takeaways
- Standard Error: \\text{SE} = s / \\sqrt{n} - uncertainty decreases with more data
- 95% CI: \\bar{x} \\pm 1.96 \\cdot \\text{SE} - it is a frequency, not a probability
- P-value: probability of data this extreme if H_0 is true
- Effect size (Cohen's d): measures how large the effect is, not just significance
- Power analysis: calculate how many samples are needed before collecting data
- Bonferroni: correct \\alpha for multiple tests
In the Next Article: we will explore the mathematics of Transformers. Self-attention, scaled dot-product, multi-head attention, positional encoding: the formulas that revolutionized NLP and AI.







