A/B Testing ML Models: Methodology, Metrics, and Implementation
You have trained two versions of your recommendation model. The new transformer-based model shows a 3% higher AUC on the holdout set. A clear improvement, right? But does this difference actually translate into a positive impact for real users? The model might perform better on certain demographic cohorts and worse on others. It might reduce the click-through rate while increasing long-term satisfaction. It might have higher latency that nullifies the accuracy gains. Offline metrics do not lie, but they only tell part of the story.
A/B testing for ML models is the methodology that answers these questions rigorously, comparing model versions on real traffic with real users and measuring the business metrics that actually matter. According to 2025 research from Aimpoint Digital Labs, organizations that adopt structured A/B testing strategies for ML models reduce production regression risk by 40% compared to direct deployments based solely on offline metrics. The MLOps market, valued at $4.38 billion in 2026 and projected to reach $89.18 billion by 2035, has A/B testing as one of its fundamental building blocks.
In this guide we will build a complete A/B testing system for ML models: from statistical theory to a FastAPI router, from canary deployment to shadow mode, from frequentist tests to Bayesian A/B testing with Thompson Sampling, through to test monitoring with Prometheus and Grafana.
What You Will Learn
- Key differences between ML A/B testing and classic web A/B testing
- Experiment design: sample size, statistical power, success metrics
- Traffic splitting with a FastAPI router and progressive canary deployment
- Shadow mode: testing without user impact
- Multi-Armed Bandits and Thompson Sampling as alternatives to classic A/B testing
- Statistical analysis: p-value, confidence intervals, effect size
- Bayesian A/B testing for faster decision making
- Test monitoring with Prometheus and Grafana
- Best practices and anti-patterns to avoid
ML A/B Testing vs Web A/B Testing: Critical Differences
A/B testing was born in web analytics to compare landing page variants, buttons and copy. The basic statistical framework is the same, but A/B testing for ML models has additional complexities that make it substantially different in practice.
In web testing you compare discrete visual experiences: variant A and variant B are clearly separated. In ML models, predictions are continuous, distributed and often correlated over time. A recommendation model serving the same user in different sessions does not produce independent predictions: there is temporal correlation that violates the independence assumptions of classical statistical tests.
Key Differences: ML vs Web A/B Testing
- Metrics: web testing optimizes CTR or conversion rate; ML testing simultaneously optimizes offline metrics (AUC, RMSE) and business metrics (revenue, churn rate, NPS), which often conflict.
- Feedback latency: web results are immediate (click); ML results may take days or weeks (churn after 30 days, revenue after a quarter).
- Effect distribution: a model may perform better on average but worse on specific cohorts (age bias, geographic bias), requiring segmented analysis.
- System effects: in feedback loop systems (recommendations, dynamic pricing), model B influences the data that will train model C.
- Operational risks: a bug in a web variant causes a poor UX; a bug in a fraud detection ML model can cause significant financial losses.
Experiment Design: Before the Code
A poorly designed A/B test is worse than no A/B test: it provides a false sense of scientific rigor while producing wrong conclusions. Experiment design must precede any technical implementation.
Defining Success Metrics
Every experiment must have a single Primary Metric that determines the winner, plus zero to two Guardrail Metrics that model B must not degrade compared to A. The primary metric must be directly causally linked to the business objective.
Examples of metrics for different scenarios:
- Churn model: primary = 30-day retention rate; guardrails = P95 latency, campaign cost
- Recommendation model: primary = revenue per session; guardrails = CTR, recommendation diversity
- Fraud model: primary = undetected fraud rate; guardrails = false positive rate, latency
- Pricing model: primary = gross margin; guardrails = conversion rate, NPS
Sample Size Calculation
The required sample size depends on three factors: the minimum effect size you want to detect (minimum detectable effect, MDE), the significance level alpha (usually 0.05) and the statistical power 1-beta (usually 0.80).
# sample_size_calculator.py
# Sample size calculation for ML A/B testing
import numpy as np
from scipy.stats import norm
import math
def calculate_sample_size(
baseline_rate: float,
minimum_detectable_effect: float,
alpha: float = 0.05,
power: float = 0.80,
two_tailed: bool = True
) -> int:
"""
Calculates sample size for an A/B test on proportions.
Args:
baseline_rate: Current rate for model A (e.g., 0.15 for 15% churn)
minimum_detectable_effect: Minimum relative change to detect (e.g., 0.05 for +5%)
alpha: Significance level (type I error rate)
power: Statistical power (1 - type II error rate)
two_tailed: True for two-tailed test (recommended default)
Returns:
Sample size for each of the two variants
"""
p1 = baseline_rate
p2 = baseline_rate * (1 + minimum_detectable_effect)
z_alpha = norm.ppf(1 - alpha / (2 if two_tailed else 1))
z_beta = norm.ppf(power)
p_avg = (p1 + p2) / 2
q_avg = 1 - p_avg
numerator = (
z_alpha * math.sqrt(2 * p_avg * q_avg)
+ z_beta * math.sqrt(p1 * (1-p1) + p2 * (1-p2))
) ** 2
denominator = (p2 - p1) ** 2
return math.ceil(numerator / denominator)
def calculate_duration_days(
sample_size_per_variant: int,
daily_requests: int,
traffic_split: float = 0.5
) -> float:
"""Estimates test duration in days."""
return sample_size_per_variant / (daily_requests * traffic_split)
# --- Practical example: churn model ---
baseline_churn_rate = 0.18 # 18% current churn (model A)
mde = 0.10 # detect a 10% relative improvement (18% to 16.2%)
n_per_variant = calculate_sample_size(
baseline_rate=baseline_churn_rate,
minimum_detectable_effect=-mde,
alpha=0.05,
power=0.80
)
daily_traffic = 5000
test_duration = calculate_duration_days(n_per_variant, daily_traffic, 0.5)
print(f"Sample size per variant: {n_per_variant:,} samples")
print(f"Estimated test duration: {test_duration:.1f} days")
print(f"Total traffic needed: {n_per_variant * 2:,} requests")
# Typical output:
# Sample size per variant: 8,744 samples
# Estimated test duration: 3.5 days
# Total traffic needed: 17,488 requests
The Peeking Problem: Do Not Look at Results Too Early
"Peeking" (or optional stopping) is one of the most common mistakes in A/B testing: checking intermediate results and stopping the test as soon as statistical significance is reached. This dramatically increases the false positive rate: if you look at the data every day, the probability of finding a significant result by chance rises to 30% even if the two variants are identical. Always use a pre-determined sample size and check results only at the end of the test, or adopt sequential testing methods like Sequential Probability Ratio Tests (SPRT).
Traffic Splitting with FastAPI
The A/B testing router is the central component of the infrastructure. It must distribute traffic deterministically (the same user must always go to the same variant for the entire duration of the test), record which variant was assigned to each user and each prediction, and be extremely fast to avoid adding latency to the critical path.
# ab_router.py
# A/B testing router for ML models with FastAPI
from fastapi import FastAPI
from pydantic import BaseModel
import hashlib
import time
import logging
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
from starlette.responses import Response
logger = logging.getLogger(__name__)
app = FastAPI(title="ML A/B Testing Router")
# --- Prometheus Metrics ---
AB_REQUESTS = Counter(
"ab_test_requests_total",
"Total requests by variant",
labelnames=["experiment_id", "variant", "model_version"]
)
AB_LATENCY = Histogram(
"ab_test_latency_seconds",
"Inference latency by variant",
labelnames=["experiment_id", "variant"],
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0]
)
AB_PREDICTIONS = Counter(
"ab_test_predictions_total",
"Prediction distribution by variant",
labelnames=["experiment_id", "variant", "prediction_bucket"]
)
# --- Experiment configuration ---
ACTIVE_EXPERIMENT = {
"experiment_id": "churn_model_v2_vs_v3",
"model_a": {
"name": "churn-model-v2",
"endpoint": "http://model-a-service:8080/predict",
"traffic_weight": 0.5
},
"model_b": {
"name": "churn-model-v3",
"endpoint": "http://model-b-service:8080/predict",
"traffic_weight": 0.5
}
}
class PredictionRequest(BaseModel):
user_id: str
features: dict
def assign_variant(user_id: str, experiment_id: str, traffic_split: float = 0.5) -> str:
"""
Deterministically assigns a user to a variant.
The same user_id + experiment_id always produce the same result.
"""
hash_input = f"{user_id}:{experiment_id}"
hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
normalized = (hash_value % 10000) / 10000.0
return "A" if normalized < traffic_split else "B"
async def call_model(endpoint: str, features: dict) -> dict:
import httpx
async with httpx.AsyncClient(timeout=2.0) as client:
response = await client.post(endpoint, json=features)
response.raise_for_status()
return response.json()
@app.post("/predict")
async def predict(request: PredictionRequest):
exp = ACTIVE_EXPERIMENT
exp_id = exp["experiment_id"]
variant = assign_variant(
user_id=request.user_id,
experiment_id=exp_id,
traffic_split=exp["model_a"]["traffic_weight"]
)
model_config = exp["model_a"] if variant == "A" else exp["model_b"]
AB_REQUESTS.labels(
experiment_id=exp_id,
variant=variant,
model_version=model_config["name"]
).inc()
start_time = time.time()
result = await call_model(model_config["endpoint"], request.features)
latency = time.time() - start_time
AB_LATENCY.labels(experiment_id=exp_id, variant=variant).observe(latency)
score = result.get("churn_probability", 0)
bucket = "high" if score > 0.7 else ("medium" if score > 0.3 else "low")
AB_PREDICTIONS.labels(
experiment_id=exp_id, variant=variant, prediction_bucket=bucket
).inc()
return {
"prediction": result,
"variant": variant,
"model_version": model_config["name"],
"experiment_id": exp_id,
"latency_ms": round(latency * 1000, 2)
}
@app.get("/metrics")
async def metrics():
return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)
Canary Deployment: Progressive Rollout
Canary deployment is a progressive release strategy where the new model (the "canary") initially receives only a small percentage of production traffic, typically 1-5%. If metrics remain stable, the percentage is gradually increased: 5% → 10% → 25% → 50% → 100%. If anomalies appear, immediate rollback sends all traffic back to the stable model.
Unlike a classic 50/50 A/B test, canary deployment is oriented toward risk reduction rather than statistical detection of differences. The goal is not to prove with statistical significance that the new model is better, but to verify it does not cause technical issues or obvious regressions before scaling.
# canary_deployment.py
# Canary deployment with automatic rollback
import hashlib
import logging
from dataclasses import dataclass, field
from typing import Optional
from prometheus_client import Gauge
logger = logging.getLogger(__name__)
CANARY_TRAFFIC_WEIGHT = Gauge(
"canary_traffic_weight_percent",
"Percentage of traffic routed to canary model",
labelnames=["experiment_id"]
)
@dataclass
class CanaryConfig:
experiment_id: str
stable_model_endpoint: str
canary_model_endpoint: str
initial_canary_weight: float = 0.05 # Start at 5%
max_canary_weight: float = 1.0 # Final target: 100%
step_size: float = 0.10 # Increment per step
step_interval_minutes: int = 30 # Increase every 30 minutes
max_error_rate: float = 0.02 # Rollback if errors > 2%
max_latency_p99_ms: float = 500.0 # Rollback if P99 > 500ms
current_weight: float = field(init=False)
def __post_init__(self):
self.current_weight = self.initial_canary_weight
class CanaryController:
"""
Progressively increases canary traffic.
Automatically rolls back if metrics exceed thresholds.
"""
def __init__(self, config: CanaryConfig):
self.config = config
self.error_count = 0
self.total_count = 0
self.latencies = []
self.is_rolled_back = False
self.is_promoted = False
def should_route_to_canary(self, user_id: str) -> bool:
if self.is_rolled_back:
return False
hash_val = int(hashlib.md5(
f"{user_id}:{self.config.experiment_id}".encode()
).hexdigest(), 16)
normalized = (hash_val % 10000) / 10000.0
return normalized < self.config.current_weight
def record_outcome(self, is_canary: bool, success: bool, latency_ms: float):
if not is_canary:
return
self.total_count += 1
if not success:
self.error_count += 1
self.latencies.append(latency_ms)
error_rate = self.error_count / max(self.total_count, 1)
if error_rate > self.config.max_error_rate and self.total_count > 100:
logger.critical(f"Error rate {error_rate:.2%} exceeded threshold. Rolling back.")
self.rollback()
if len(self.latencies) >= 100:
p99 = sorted(self.latencies)[int(len(self.latencies) * 0.99)]
if p99 > self.config.max_latency_p99_ms:
logger.critical(f"P99 latency {p99:.0f}ms exceeded threshold. Rolling back.")
self.rollback()
def advance_canary(self):
if self.is_rolled_back or self.is_promoted:
return
new_weight = min(
self.config.current_weight + self.config.step_size,
self.config.max_canary_weight
)
self.config.current_weight = new_weight
CANARY_TRAFFIC_WEIGHT.labels(
experiment_id=self.config.experiment_id
).set(new_weight * 100)
logger.info(f"Canary weight -> {new_weight:.0%}")
if new_weight >= self.config.max_canary_weight:
self.is_promoted = True
logger.info("Canary fully promoted!")
def rollback(self):
self.config.current_weight = 0.0
self.is_rolled_back = True
CANARY_TRAFFIC_WEIGHT.labels(
experiment_id=self.config.experiment_id
).set(0)
logger.warning(f"ROLLBACK for {self.config.experiment_id}")
Shadow Mode: Testing Without User Impact
Shadow mode (or shadow deployment) is the most conservative and at the same time most powerful technique for validating a new model before exposing it to users. Production traffic is duplicated: model A serves real requests and its predictions are returned to users, while model B receives the same requests in parallel but its predictions are discarded or only logged.
This approach allows comparing the two models on real traffic with zero risk to users or the business. It is ideal for validating that the new model has no critical bugs, meets latency requirements under real load, does not produce anomalous or out-of-distribution predictions, and behaves as expected across all user segments.
# shadow_mode.py
# Shadow deployment with async logging
import asyncio
import httpx
import logging
import json
from datetime import datetime
logger = logging.getLogger(__name__)
class ShadowModeRouter:
"""
Routes requests to both the production model and the shadow model.
Only the production model's predictions reach users.
"""
def __init__(
self,
production_endpoint: str,
shadow_endpoint: str,
shadow_log_file: str = "shadow_predictions.jsonl"
):
self.production_endpoint = production_endpoint
self.shadow_endpoint = shadow_endpoint
self.shadow_log_file = shadow_log_file
async def predict(self, request_data: dict, request_id: str) -> dict:
prod_task = asyncio.create_task(
self._call_model(self.production_endpoint, request_data, "production")
)
shadow_task = asyncio.create_task(
self._call_model(self.shadow_endpoint, request_data, "shadow")
)
# Return production response immediately; log shadow in background
prod_result = await prod_task
asyncio.create_task(
self._log_shadow_result(shadow_task, request_id, request_data, prod_result)
)
return prod_result
async def _call_model(self, endpoint: str, data: dict, label: str) -> dict:
start = asyncio.get_event_loop().time()
try:
async with httpx.AsyncClient(timeout=2.0) as client:
response = await client.post(endpoint, json=data)
response.raise_for_status()
result = response.json()
result["_latency_ms"] = (asyncio.get_event_loop().time() - start) * 1000
return result
except Exception as e:
return {"error": str(e), "_latency_ms": -1}
async def _log_shadow_result(
self, shadow_task, request_id, input_data, prod_result
):
try:
shadow_result = await shadow_task
except Exception as e:
shadow_result = {"error": str(e)}
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"request_id": request_id,
"production_prediction": prod_result.get("prediction"),
"production_latency_ms": prod_result.get("_latency_ms"),
"shadow_prediction": shadow_result.get("prediction"),
"shadow_latency_ms": shadow_result.get("_latency_ms"),
"shadow_error": shadow_result.get("error"),
"predictions_agree": (
prod_result.get("prediction") == shadow_result.get("prediction")
)
}
with open(self.shadow_log_file, "a") as f:
f.write(json.dumps(log_entry) + "\n")
# --- Analyze shadow results offline ---
def analyze_shadow_results(log_file: str):
import pandas as pd
records = []
with open(log_file) as f:
for line in f:
records.append(json.loads(line))
df = pd.DataFrame(records)
total = len(df)
agreement_rate = df["predictions_agree"].mean()
shadow_errors = df["shadow_error"].notna().sum()
print(f"Total requests analyzed: {total:,}")
print(f"Prediction agreement rate: {agreement_rate:.1%}")
print(f"Shadow model errors: {shadow_errors} ({shadow_errors/total:.1%})")
print(f"Avg production latency: {df['production_latency_ms'].mean():.1f}ms")
print(f"Avg shadow latency: {df['shadow_latency_ms'].mean():.1f}ms")
return df
Multi-Armed Bandits: Beyond Classic A/B Testing
The main limitation of classic A/B testing is the exploration cost: for the entire duration of the test, a fraction of users receives the potentially inferior model. If model B is clearly superior, we are "wasting" conversions of users assigned to A during the weeks of testing.
Multi-Armed Bandits (MAB) solve the exploration-exploitation problem: instead of maintaining a fixed split for the entire test duration, the algorithm dynamically adapts traffic toward the better-performing model, maximizing total conversions during the test itself. A 2025 study from Aimpoint Digital Labs shows that bandit approaches like Thompson Sampling can reduce cumulative regret by 20-35% compared to classic A/B testing in high-effect scenarios.
# thompson_sampling_bandit.py
# Multi-Armed Bandit with Thompson Sampling for ML model selection
import numpy as np
from dataclasses import dataclass
from typing import List, Tuple, Optional
@dataclass
class ModelArm:
"""Represents a model as a bandit arm."""
name: str
endpoint: str
alpha: float = 1.0 # Successes (Beta distribution)
beta: float = 1.0 # Failures (Beta distribution)
@property
def estimated_success_rate(self) -> float:
return self.alpha / (self.alpha + self.beta)
@property
def total_observations(self) -> int:
return int(self.alpha + self.beta - 2)
def sample(self) -> float:
"""Sample from the Beta posterior (Thompson Sampling)."""
return np.random.beta(self.alpha, self.beta)
def update(self, reward: float):
"""
Update distribution with new outcome.
reward = 1.0 for success, 0.0 for failure
"""
if reward >= 0.5:
self.alpha += 1
else:
self.beta += 1
class ThompsonSamplingBandit:
"""
Multi-Armed Bandit with Thompson Sampling.
Optimal for adaptive ML model selection.
"""
def __init__(self, models: List[ModelArm]):
self.models = models
self.selection_history = []
def select_model(self) -> Tuple[int, ModelArm]:
"""Select model by sampling from Beta distributions."""
samples = [arm.sample() for arm in self.models]
best_idx = int(np.argmax(samples))
self.selection_history.append(best_idx)
return best_idx, self.models[best_idx]
def update(self, arm_idx: int, reward: float):
self.models[arm_idx].update(reward)
def get_traffic_allocation(self) -> dict:
if not self.selection_history:
return {arm.name: 1/len(self.models) for arm in self.models}
recent = self.selection_history[-1000:]
total = len(recent)
return {arm.name: recent.count(i) / total for i, arm in enumerate(self.models)}
def check_convergence(self, min_observations: int = 500) -> Optional[str]:
"""
Checks if the bandit has converged toward a clear winner.
Returns winner name or None if still uncertain.
"""
for arm in self.models:
if arm.total_observations < min_observations:
return None
rates = sorted(
[(arm.name, arm.estimated_success_rate) for arm in self.models],
key=lambda x: x[1], reverse=True
)
best_name, best_rate = rates[0]
_, second_rate = rates[1]
# Declare winner if margin > 3%
if best_rate - second_rate > 0.03:
return best_name
return None
# --- Usage example ---
models = [
ModelArm(name="churn-model-v2", endpoint="http://model-v2:8080/predict"),
ModelArm(name="churn-model-v3", endpoint="http://model-v3:8080/predict"),
]
bandit = ThompsonSamplingBandit(models)
# Simulate 1000 interactions
np.random.seed(42)
true_rates = {"churn-model-v2": 0.72, "churn-model-v3": 0.78}
for i in range(1000):
arm_idx, selected_model = bandit.select_model()
reward = float(np.random.random() < true_rates[selected_model.name])
bandit.update(arm_idx, reward)
if (i + 1) % 200 == 0:
alloc = bandit.get_traffic_allocation()
print(f"Step {i+1} - Traffic: {alloc}")
winner = bandit.check_convergence(100)
if winner:
print(f" => WINNER: {winner}")
Statistical Analysis: p-values, Confidence Intervals and Effect Size
At the end of the test period, statistical analysis must answer three distinct questions: is the observed difference statistically significant? How large is the effect? Is the effect practically relevant for the business?
# statistical_analysis.py
# Statistical analysis of A/B test results for ML models
import numpy as np
import math
from scipy.stats import norm
def analyze_ab_test_results(
conversions_a: int, total_a: int,
conversions_b: int, total_b: int,
alpha: float = 0.05
) -> dict:
"""
Complete statistical analysis of an A/B test on proportions.
"""
p_a = conversions_a / total_a
p_b = conversions_b / total_b
# Z-test for difference of proportions
p_pool = (conversions_a + conversions_b) / (total_a + total_b)
se_pool = math.sqrt(p_pool * (1 - p_pool) * (1/total_a + 1/total_b))
z_statistic = (p_b - p_a) / se_pool
p_value = 2 * (1 - norm.cdf(abs(z_statistic)))
# Confidence interval for the difference
se_diff = math.sqrt(p_a*(1-p_a)/total_a + p_b*(1-p_b)/total_b)
z_critical = norm.ppf(1 - alpha/2)
diff = p_b - p_a
ci_lower = diff - z_critical * se_diff
ci_upper = diff + z_critical * se_diff
# Effect size (Cohen's h for proportions)
cohens_h = 2*math.asin(math.sqrt(p_b)) - 2*math.asin(math.sqrt(p_a))
effect_magnitude = (
"negligible" if abs(cohens_h) < 0.2
else "small" if abs(cohens_h) < 0.5
else "medium" if abs(cohens_h) < 0.8
else "large"
)
relative_lift = (p_b - p_a) / p_a if p_a > 0 else 0
is_significant = p_value < alpha
return {
"variant_a": {"rate": round(p_a, 4), "rate_pct": f"{p_a:.2%}"},
"variant_b": {"rate": round(p_b, 4), "rate_pct": f"{p_b:.2%}"},
"difference": {
"absolute": round(diff, 4),
"relative_lift_pct": f"{relative_lift:.2%}",
"confidence_interval_95": (round(ci_lower, 4), round(ci_upper, 4))
},
"statistics": {
"p_value": round(p_value, 6),
"is_significant": is_significant,
"cohens_h": round(cohens_h, 4),
"effect_magnitude": effect_magnitude
},
"conclusion": (
f"Model B is statistically better (p={p_value:.4f}, lift={relative_lift:.2%})"
if is_significant and diff > 0
else f"No significant difference detected (p={p_value:.4f})"
)
}
# --- Practical example ---
results = analyze_ab_test_results(
conversions_a=1380, total_a=8500,
conversions_b=1545, total_b=8200
)
print("=== A/B TEST RESULTS ===")
print(f"Model A: {results['variant_a']['rate_pct']} retention rate")
print(f"Model B: {results['variant_b']['rate_pct']} retention rate")
print(f"Relative lift: {results['difference']['relative_lift_pct']}")
print(f"95% CI: {results['difference']['confidence_interval_95']}")
print(f"p-value: {results['statistics']['p_value']}")
print(f"Effect size: {results['statistics']['effect_magnitude']}")
print(f"\nConclusion: {results['conclusion']}")
Bayesian A/B Testing
The frequentist approach with p-values has well-known limitations: a p-value is not the probability that model B is better (it is the probability of observing data as extreme as this if H0 were true). The Bayesian approach directly answers the question we care about: what is the probability that model B is better than A, and by how much?
The Bayesian approach also allows stopping the test when a sufficiently high probability is reached (e.g., 95%) that one model is the best, without the peeking problem typical of frequentism.
# bayesian_ab_test.py
# Bayesian A/B testing for ML models
import numpy as np
def bayesian_ab_test(
successes_a: int, trials_a: int,
successes_b: int, trials_b: int,
prior_alpha: float = 1.0,
prior_beta: float = 1.0,
n_samples: int = 100_000
) -> dict:
"""
Bayesian A/B test using Beta distribution as prior/posterior.
Beta prior + Binomial likelihood = Beta posterior (conjugate pair).
"""
# Update priors with observed data
alpha_a = prior_alpha + successes_a
beta_a = prior_beta + (trials_a - successes_a)
alpha_b = prior_alpha + successes_b
beta_b = prior_beta + (trials_b - successes_b)
# Sample from posterior distributions
samples_a = np.random.beta(alpha_a, beta_a, n_samples)
samples_b = np.random.beta(alpha_b, beta_b, n_samples)
# Probability that B is better than A
prob_b_better = float(np.mean(samples_b > samples_a))
# Relative lift distribution
lift_samples = (samples_b - samples_a) / samples_a
lift_mean = float(np.mean(lift_samples))
ci_lower = float(np.percentile(lift_samples, 2.5))
ci_upper = float(np.percentile(lift_samples, 97.5))
prob_lift_2pct = float(np.mean(lift_samples > 0.02))
# Expected loss: cost of choosing the wrong model
expected_loss_a = float(np.mean(np.maximum(samples_b - samples_a, 0)))
expected_loss_b = float(np.mean(np.maximum(samples_a - samples_b, 0)))
return {
"prob_b_better_than_a": round(prob_b_better, 4),
"lift": {
"mean": round(lift_mean, 4),
"credible_interval_95pct": (round(ci_lower, 4), round(ci_upper, 4)),
"prob_lift_above_2pct": round(prob_lift_2pct, 4)
},
"expected_loss": {
"choose_a": round(expected_loss_a, 6),
"choose_b": round(expected_loss_b, 6),
"recommended_choice": "B" if expected_loss_b < expected_loss_a else "A"
},
"decision": (
"Choose B" if prob_b_better > 0.95
else "Choose A" if prob_b_better < 0.05
else f"Uncertain (P(B>A) = {prob_b_better:.1%}) - collect more data"
)
}
# --- Example ---
result = bayesian_ab_test(
successes_a=1380, trials_a=8500,
successes_b=1545, trials_b=8200
)
print("=== BAYESIAN A/B TEST ===")
print(f"P(B > A) = {result['prob_b_better_than_a']:.1%}")
print(f"Mean lift: {result['lift']['mean']:.2%}")
print(f"95% Credible Interval: {result['lift']['credible_interval_95pct']}")
print(f"P(lift > 2%): {result['lift']['prob_lift_above_2pct']:.1%}")
print(f"Recommended choice: {result['expected_loss']['recommended_choice']}")
print(f"Decision: {result['decision']}")
Monitoring Tests with Prometheus and Grafana
An active A/B test in production must be monitored continuously. Do not just wait for the test to end to analyze results: ensure both variants work correctly on the technical side (latency, error rate, availability) and that business metrics align with initial expectations.
# Key PromQL queries for Grafana dashboards during A/B tests:
# 1. Traffic distribution between variants (should be ~50/50)
# sum by (variant) (rate(ab_test_requests_total[5m]))
# 2. P95 latency per variant
# histogram_quantile(0.95, sum by (variant, le) (
# rate(ab_test_latency_seconds_bucket[5m])
# ))
# 3. Error rate per variant
# sum by (variant) (rate(ab_test_errors_total[5m])) /
# sum by (variant) (rate(ab_test_requests_total[5m]))
# 4. Prediction distribution per variant (drift indicator)
# sum by (variant, prediction_bucket) (rate(ab_test_predictions_total[1h]))
---
# prometheus_ab_alerts.yml
groups:
- name: ab_test_alerts
rules:
- alert: ABTestTrafficImbalance
expr: |
abs(
sum(rate(ab_test_requests_total{variant="A"}[10m]))
/ sum(rate(ab_test_requests_total[10m]))
- 0.5
) > 0.10
for: 5m
labels:
severity: warning
annotations:
summary: "A/B test traffic imbalance detected"
description: "Traffic split deviates more than 10% from 50/50"
- alert: ABTestVariantBHighErrors
expr: |
(
sum(rate(ab_test_errors_total{variant="B"}[5m]))
/ sum(rate(ab_test_requests_total{variant="B"}[5m]))
) > 2 * (
sum(rate(ab_test_errors_total{variant="A"}[5m]))
/ sum(rate(ab_test_requests_total{variant="A"}[5m]))
)
for: 10m
labels:
severity: critical
annotations:
summary: "Variant B error rate is more than 2x variant A"
description: "Consider rolling back variant B immediately"
Budget Under 5K EUR/Year: Complete A/B Testing Stack
A complete A/B testing system for ML models does not require an enterprise budget. With an open-source stack and a small VPS, you can have everything you need:
- FastAPI router + Python stats: Open-source, free
- Prometheus + Grafana: Open-source, free
- VPS hosting (Hetzner/OVH): 20-40 EUR/month (240-480 EUR/year)
- Feature flag service (Unleash self-hosted): Open-source, free
- MLflow model registry: Open-source, free
- Estimated total infrastructure cost: 300-600 EUR/year
Best Practices and Anti-Patterns
Pre-Experiment Checklist
- Calculate sample size before starting: never run a test "until something interesting appears". The pre-determined sample size is non-negotiable.
- Define ONE primary metric: optimizing for two metrics simultaneously makes the decision ambiguous. Guardrail metrics exist to prevent regressions, not to declare winners.
- Run an A/A test first: before launching the real test, run an A/A test (same model on both variants) to verify the assignment mechanism has no bugs causing artificial differences.
- Document hypotheses before running: write down why you expect model B to be better and by how much. This prevents HARKing bias.
- Analyze new users separately: new users have no history with either model and exhibit different behaviors. Analyze their results independently.
Anti-Patterns to Avoid Absolutely
- Continuous peeking: checking results daily and stopping at first statistical significance increases false positives to 30%. Use Sequential Probability Ratio Tests (SPRT) if you need early stopping.
- HARKing (Hypothesizing After Results are Known): analyzing data to find any significant difference, then telling the story as if it were hypothesized upfront. With 20 segments tested, one will be significant by chance alone with alpha = 0.05.
- Ignoring metric variance: metrics like revenue per user have very heavy tails. A single whale user can make a non-existent effect appear significant. Use bootstrap or non-parametric tests for non-Gaussian metrics.
- Tests that are too short: weekly effects and novelty effects (users respond positively to novelty for 1-2 days then revert) require at least 2 weeks to average out.
- Feedback loops in ML systems: in systems with feedback loops (recommendations, dynamic pricing), predictions from the two variants are not independent. Model this correlation explicitly.
When to Use Which Approach
Strategy Selection Guide
- Shadow Mode: use when the model is completely new, not yet validated, or when the risk of a critical bug is too high. Always the first step before any test with real users.
- Canary Deployment: use to reduce the operational risk of a new deployment. Ideal for critical models (fraud, pricing) where a regression has immediate financial impact.
- Classic A/B Test (50/50): use when you want to measure the business effect with maximum statistical power and operational risk is low. Requires sufficient sample size and a fast feedback loop.
- Multi-Armed Bandit: use when feedback is fast (within hours/days), the exploration cost is high and you prefer maximizing conversions during the test. Not ideal for small effects with slow feedback.
- Bayesian A/B: use when you want flexible stopping rules, to interpret probabilities directly or when you have prior information from previous experiments. Ideal for teams that find p-values confusing.
Conclusion and Next Steps
A/B testing for ML models is much more than a simple traffic split. It requires rigorous statistical design before any implementation, choosing the right strategy based on context (shadow, canary, 50/50, bandit), continuous monitoring during the test, and correct statistical analysis at the end.
The difference between a team that does A/B testing correctly and one that does it poorly is not in the complexity of the code but in the discipline of the process: define hypotheses first, do not look at data during the test, analyze everything correctly afterward. With the open-source stack described in this guide (FastAPI, Prometheus, Grafana, scipy, numpy), you can implement a production-grade system on a minimal budget.
The natural next step is to integrate A/B testing with ML governance: every decision to promote a model to production must be documented, auditable and compliant with ethical and regulatory standards. We will cover this in the next article on ML Governance.
Continue the MLOps Series
- Previous article: Scaling ML on Kubernetes - orchestrating deployments with KubeFlow and Seldon Core
- Next article: ML Governance: Compliance, Audit, Ethics - EU AI Act, explainability and fairness
- Related: Detecting Model Drift: Monitoring and Automated Retraining
- Related: Serving ML Models: FastAPI, Uvicorn and Containerization
- Related series: Deep Learning Avanzato - A/B testing for complex neural models







