Hi! I'm

Federico Calò

Software Developer | Technical Writer

I create modern web applications and custom digital tools to help businesses grow through technological innovation. My passion is combining computer science and economics to generate real value.

Contact Me

About Me

My passion for computer science was born at the Technical Commercial Institute of Maglie, where I discovered the power of programming and the fascination of creating digital solutions. From the start, I understood that computer science was not just code, but an extraordinary tool for turning ideas into reality.

During my studies in Business Information Systems, I began to interweave computer science and economics, understanding how technology can be the engine of growth for any business. This vision accompanied me to the University of Bari, where I obtained my degree in Computer Science, deepening my technical skills and passion for software development.

Today I put this experience at the service of businesses, professionals and startups, creating tailor-made digital solutions that automate processes, optimize resources and open new business opportunities. Because true innovation begins when technology meets the real needs of people.

My Skills

Data Analysis & Predictive Models

I transform data into strategic insights with in-depth analysis and predictive models for informed decisions

Process Automation

I create custom tools that automate repetitive operations and free up time for value-added activities

Custom Systems

I develop tailor-made software systems, from platform integrations to customized dashboards

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Che tu gestisca un negozio, uno studio professionale o un'azienda, posso aiutarti a sfruttare le potenzialità dell'informatica per lavorare meglio, più velocemente e in modo più intelligente.

Parliamone Insieme →

Join the Community

Join the developer community where we discuss software, AI, architecture and DevOps. Share ideas, ask questions and grow with us.

Channel

FC Dev Blog

Get notifications on new articles, complete series, weekly tips and featured tools. Bilingual IT/EN content directly in your Telegram.

New articles as they are published
Weekly tips and code snippets
Polls on future topics

Subscribe to Channel

Group

FC Dev Community

A bilingual IT/EN community for developers. Discussions, Q&A, mutual help and networking with other professionals.

Discussions on articles and technologies
Coding help and code review
Job opportunities and collaboration

Join the Group

Discussion Topics

View

Master SQL

RoadMap.sh

November 2024

View

Oracle Certified Foundations Associate

Oracle

October 2024

View

People Leadership Credential

Connect

September 2024

💻 Languages & Technologies

Java

Python

JavaScript

Angular

React

TypeScript

SQL

PHP

CSS/SCSS

Node.js

Docker

Git

💼

12/2024 - Present

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italy · Hybrid Analysis and development of computer systems through the use of Java and Quarkus in Health and Public Sector. Continuous training on modern technologies for creating customized and efficient software solutions and on agents.

💼

06/2022 - 12/2024

Software analyst and Back End Developer Associate Consultant

Links Management and Technology SpA

Experience analyzing as-is software systems and ETL flows using PowerCenter. Completed Spring Boot training for developing modern and scalable backend applications. Backend developer specialized in Spring Boot, with experience in database design, analysis, development and testing of assigned tasks.

💼

02/2021 - 10/2021

Software programmer

Adesso.it (prima era WebScience srl)

Experience in AS-IS and TO-BE analysis, SEO evolutions and website evolutions to improve user performance and engagement.

🎓

2018 - 2025

Degree in Computer Science

University of Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Corporate Information Systems

Technical Commercial Institute of Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

A/B Testing ML Models: Methodology, Metrics, and Implementation

You have trained two versions of your recommendation model. The new transformer-based model shows a 3% higher AUC on the holdout set. A clear improvement, right? But does this difference actually translate into a positive impact for real users? The model might perform better on certain demographic cohorts and worse on others. It might reduce the click-through rate while increasing long-term satisfaction. It might have higher latency that nullifies the accuracy gains. Offline metrics do not lie, but they only tell part of the story.

A/B testing for ML models is the methodology that answers these questions rigorously, comparing model versions on real traffic with real users and measuring the business metrics that actually matter. According to 2025 research from Aimpoint Digital Labs, organizations that adopt structured A/B testing strategies for ML models reduce production regression risk by 40% compared to direct deployments based solely on offline metrics. The MLOps market, valued at $4.38 billion in 2026 and projected to reach $89.18 billion by 2035, has A/B testing as one of its fundamental building blocks.

In this guide we will build a complete A/B testing system for ML models: from statistical theory to a FastAPI router, from canary deployment to shadow mode, from frequentist tests to Bayesian A/B testing with Thompson Sampling, through to test monitoring with Prometheus and Grafana.

What You Will Learn

Key differences between ML A/B testing and classic web A/B testing
Experiment design: sample size, statistical power, success metrics
Traffic splitting with a FastAPI router and progressive canary deployment
Shadow mode: testing without user impact
Multi-Armed Bandits and Thompson Sampling as alternatives to classic A/B testing
Statistical analysis: p-value, confidence intervals, effect size
Bayesian A/B testing for faster decision making
Test monitoring with Prometheus and Grafana
Best practices and anti-patterns to avoid

ML A/B Testing vs Web A/B Testing: Critical Differences

A/B testing was born in web analytics to compare landing page variants, buttons and copy. The basic statistical framework is the same, but A/B testing for ML models has additional complexities that make it substantially different in practice.

In web testing you compare discrete visual experiences: variant A and variant B are clearly separated. In ML models, predictions are continuous, distributed and often correlated over time. A recommendation model serving the same user in different sessions does not produce independent predictions: there is temporal correlation that violates the independence assumptions of classical statistical tests.

      Key Differences: ML vs Web A/B Testing
      
          Metrics: web testing optimizes CTR or conversion rate; ML testing
          simultaneously optimizes offline metrics (AUC, RMSE) and business metrics
          (revenue, churn rate, NPS), which often conflict.
        
          Feedback latency: web results are immediate (click); ML results
          may take days or weeks (churn after 30 days, revenue after a quarter).
        
          Effect distribution: a model may perform better on average but
          worse on specific cohorts (age bias, geographic bias), requiring segmented analysis.
        
          System effects: in feedback loop systems (recommendations,
          dynamic pricing), model B influences the data that will train model C.
        
          Operational risks: a bug in a web variant causes a poor UX;
          a bug in a fraud detection ML model can cause significant financial losses.

Experiment Design: Before the Code

A poorly designed A/B test is worse than no A/B test: it provides a false sense of scientific rigor while producing wrong conclusions. Experiment design must precede any technical implementation.

Defining Success Metrics

Every experiment must have a single Primary Metric that determines the winner, plus zero to two Guardrail Metrics that model B must not degrade compared to A. The primary metric must be directly causally linked to the business objective.

Examples of metrics for different scenarios:

Churn model: primary = 30-day retention rate; guardrails = P95 latency, campaign cost
Recommendation model: primary = revenue per session; guardrails = CTR, recommendation diversity
Fraud model: primary = undetected fraud rate; guardrails = false positive rate, latency
Pricing model: primary = gross margin; guardrails = conversion rate, NPS

Sample Size Calculation

The required sample size depends on three factors: the minimum effect size you want to detect (minimum detectable effect, MDE), the significance level alpha (usually 0.05) and the statistical power 1-beta (usually 0.80).

# sample_size_calculator.py
# Sample size calculation for ML A/B testing

import numpy as np
from scipy.stats import norm
import math

def calculate_sample_size(
    baseline_rate: float,
    minimum_detectable_effect: float,
    alpha: float = 0.05,
    power: float = 0.80,
    two_tailed: bool = True
) -> int:
    """
    Calculates sample size for an A/B test on proportions.

    Args:
        baseline_rate: Current rate for model A (e.g., 0.15 for 15% churn)
        minimum_detectable_effect: Minimum relative change to detect (e.g., 0.05 for +5%)
        alpha: Significance level (type I error rate)
        power: Statistical power (1 - type II error rate)
        two_tailed: True for two-tailed test (recommended default)

    Returns:
        Sample size for each of the two variants
    """
    p1 = baseline_rate
    p2 = baseline_rate * (1 + minimum_detectable_effect)

    z_alpha = norm.ppf(1 - alpha / (2 if two_tailed else 1))
    z_beta = norm.ppf(power)

    p_avg = (p1 + p2) / 2
    q_avg = 1 - p_avg

    numerator = (
        z_alpha * math.sqrt(2 * p_avg * q_avg)
        + z_beta * math.sqrt(p1 * (1-p1) + p2 * (1-p2))
    ) ** 2
    denominator = (p2 - p1) ** 2

    return math.ceil(numerator / denominator)


def calculate_duration_days(
    sample_size_per_variant: int,
    daily_requests: int,
    traffic_split: float = 0.5
) -> float:
    """Estimates test duration in days."""
    return sample_size_per_variant / (daily_requests * traffic_split)


# --- Practical example: churn model ---
baseline_churn_rate = 0.18   # 18% current churn (model A)
mde = 0.10                    # detect a 10% relative improvement (18% to 16.2%)

n_per_variant = calculate_sample_size(
    baseline_rate=baseline_churn_rate,
    minimum_detectable_effect=-mde,
    alpha=0.05,
    power=0.80
)

daily_traffic = 5000
test_duration = calculate_duration_days(n_per_variant, daily_traffic, 0.5)

print(f"Sample size per variant: {n_per_variant:,} samples")
print(f"Estimated test duration: {test_duration:.1f} days")
print(f"Total traffic needed: {n_per_variant * 2:,} requests")

# Typical output:
# Sample size per variant: 8,744 samples
# Estimated test duration: 3.5 days
# Total traffic needed: 17,488 requests

The Peeking Problem: Do Not Look at Results Too Early

"Peeking" (or optional stopping) is one of the most common mistakes in A/B testing: checking intermediate results and stopping the test as soon as statistical significance is reached. This dramatically increases the false positive rate: if you look at the data every day, the probability of finding a significant result by chance rises to 30% even if the two variants are identical. Always use a pre-determined sample size and check results only at the end of the test, or adopt sequential testing methods like Sequential Probability Ratio Tests (SPRT).

Traffic Splitting with FastAPI

The A/B testing router is the central component of the infrastructure. It must distribute traffic deterministically (the same user must always go to the same variant for the entire duration of the test), record which variant was assigned to each user and each prediction, and be extremely fast to avoid adding latency to the critical path.

# ab_router.py
# A/B testing router for ML models with FastAPI

from fastapi import FastAPI
from pydantic import BaseModel
import hashlib
import time
import logging
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
from starlette.responses import Response

logger = logging.getLogger(__name__)
app = FastAPI(title="ML A/B Testing Router")

# --- Prometheus Metrics ---
AB_REQUESTS = Counter(
    "ab_test_requests_total",
    "Total requests by variant",
    labelnames=["experiment_id", "variant", "model_version"]
)

AB_LATENCY = Histogram(
    "ab_test_latency_seconds",
    "Inference latency by variant",
    labelnames=["experiment_id", "variant"],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0]
)

AB_PREDICTIONS = Counter(
    "ab_test_predictions_total",
    "Prediction distribution by variant",
    labelnames=["experiment_id", "variant", "prediction_bucket"]
)

# --- Experiment configuration ---
ACTIVE_EXPERIMENT = {
    "experiment_id": "churn_model_v2_vs_v3",
    "model_a": {
        "name": "churn-model-v2",
        "endpoint": "http://model-a-service:8080/predict",
        "traffic_weight": 0.5
    },
    "model_b": {
        "name": "churn-model-v3",
        "endpoint": "http://model-b-service:8080/predict",
        "traffic_weight": 0.5
    }
}


class PredictionRequest(BaseModel):
    user_id: str
    features: dict


def assign_variant(user_id: str, experiment_id: str, traffic_split: float = 0.5) -> str:
    """
    Deterministically assigns a user to a variant.
    The same user_id + experiment_id always produce the same result.
    """
    hash_input = f"{user_id}:{experiment_id}"
    hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
    normalized = (hash_value % 10000) / 10000.0
    return "A" if normalized < traffic_split else "B"


async def call_model(endpoint: str, features: dict) -> dict:
    import httpx
    async with httpx.AsyncClient(timeout=2.0) as client:
        response = await client.post(endpoint, json=features)
        response.raise_for_status()
        return response.json()


@app.post("/predict")
async def predict(request: PredictionRequest):
    exp = ACTIVE_EXPERIMENT
    exp_id = exp["experiment_id"]

    variant = assign_variant(
        user_id=request.user_id,
        experiment_id=exp_id,
        traffic_split=exp["model_a"]["traffic_weight"]
    )
    model_config = exp["model_a"] if variant == "A" else exp["model_b"]

    AB_REQUESTS.labels(
        experiment_id=exp_id,
        variant=variant,
        model_version=model_config["name"]
    ).inc()

    start_time = time.time()
    result = await call_model(model_config["endpoint"], request.features)
    latency = time.time() - start_time

    AB_LATENCY.labels(experiment_id=exp_id, variant=variant).observe(latency)

    score = result.get("churn_probability", 0)
    bucket = "high" if score > 0.7 else ("medium" if score > 0.3 else "low")
    AB_PREDICTIONS.labels(
        experiment_id=exp_id, variant=variant, prediction_bucket=bucket
    ).inc()

    return {
        "prediction": result,
        "variant": variant,
        "model_version": model_config["name"],
        "experiment_id": exp_id,
        "latency_ms": round(latency * 1000, 2)
    }


@app.get("/metrics")
async def metrics():
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

Canary Deployment: Progressive Rollout

Canary deployment is a progressive release strategy where the new model (the "canary") initially receives only a small percentage of production traffic, typically 1-5%. If metrics remain stable, the percentage is gradually increased: 5% → 10% → 25% → 50% → 100%. If anomalies appear, immediate rollback sends all traffic back to the stable model.

Unlike a classic 50/50 A/B test, canary deployment is oriented toward risk reduction rather than statistical detection of differences. The goal is not to prove with statistical significance that the new model is better, but to verify it does not cause technical issues or obvious regressions before scaling.

# canary_deployment.py
# Canary deployment with automatic rollback

import hashlib
import logging
from dataclasses import dataclass, field
from typing import Optional
from prometheus_client import Gauge

logger = logging.getLogger(__name__)

CANARY_TRAFFIC_WEIGHT = Gauge(
    "canary_traffic_weight_percent",
    "Percentage of traffic routed to canary model",
    labelnames=["experiment_id"]
)


@dataclass
class CanaryConfig:
    experiment_id: str
    stable_model_endpoint: str
    canary_model_endpoint: str
    initial_canary_weight: float = 0.05   # Start at 5%
    max_canary_weight: float = 1.0         # Final target: 100%
    step_size: float = 0.10               # Increment per step
    step_interval_minutes: int = 30       # Increase every 30 minutes
    max_error_rate: float = 0.02          # Rollback if errors > 2%
    max_latency_p99_ms: float = 500.0     # Rollback if P99 > 500ms
    current_weight: float = field(init=False)

    def __post_init__(self):
        self.current_weight = self.initial_canary_weight


class CanaryController:
    """
    Progressively increases canary traffic.
    Automatically rolls back if metrics exceed thresholds.
    """

    def __init__(self, config: CanaryConfig):
        self.config = config
        self.error_count = 0
        self.total_count = 0
        self.latencies = []
        self.is_rolled_back = False
        self.is_promoted = False

    def should_route_to_canary(self, user_id: str) -> bool:
        if self.is_rolled_back:
            return False
        hash_val = int(hashlib.md5(
            f"{user_id}:{self.config.experiment_id}".encode()
        ).hexdigest(), 16)
        normalized = (hash_val % 10000) / 10000.0
        return normalized < self.config.current_weight

    def record_outcome(self, is_canary: bool, success: bool, latency_ms: float):
        if not is_canary:
            return
        self.total_count += 1
        if not success:
            self.error_count += 1
        self.latencies.append(latency_ms)

        error_rate = self.error_count / max(self.total_count, 1)
        if error_rate > self.config.max_error_rate and self.total_count > 100:
            logger.critical(f"Error rate {error_rate:.2%} exceeded threshold. Rolling back.")
            self.rollback()

        if len(self.latencies) >= 100:
            p99 = sorted(self.latencies)[int(len(self.latencies) * 0.99)]
            if p99 > self.config.max_latency_p99_ms:
                logger.critical(f"P99 latency {p99:.0f}ms exceeded threshold. Rolling back.")
                self.rollback()

    def advance_canary(self):
        if self.is_rolled_back or self.is_promoted:
            return
        new_weight = min(
            self.config.current_weight + self.config.step_size,
            self.config.max_canary_weight
        )
        self.config.current_weight = new_weight
        CANARY_TRAFFIC_WEIGHT.labels(
            experiment_id=self.config.experiment_id
        ).set(new_weight * 100)
        logger.info(f"Canary weight -> {new_weight:.0%}")
        if new_weight >= self.config.max_canary_weight:
            self.is_promoted = True
            logger.info("Canary fully promoted!")

    def rollback(self):
        self.config.current_weight = 0.0
        self.is_rolled_back = True
        CANARY_TRAFFIC_WEIGHT.labels(
            experiment_id=self.config.experiment_id
        ).set(0)
        logger.warning(f"ROLLBACK for {self.config.experiment_id}")

Shadow Mode: Testing Without User Impact

Shadow mode (or shadow deployment) is the most conservative and at the same time most powerful technique for validating a new model before exposing it to users. Production traffic is duplicated: model A serves real requests and its predictions are returned to users, while model B receives the same requests in parallel but its predictions are discarded or only logged.

This approach allows comparing the two models on real traffic with zero risk to users or the business. It is ideal for validating that the new model has no critical bugs, meets latency requirements under real load, does not produce anomalous or out-of-distribution predictions, and behaves as expected across all user segments.

# shadow_mode.py
# Shadow deployment with async logging

import asyncio
import httpx
import logging
import json
from datetime import datetime

logger = logging.getLogger(__name__)


class ShadowModeRouter:
    """
    Routes requests to both the production model and the shadow model.
    Only the production model's predictions reach users.
    """

    def __init__(
        self,
        production_endpoint: str,
        shadow_endpoint: str,
        shadow_log_file: str = "shadow_predictions.jsonl"
    ):
        self.production_endpoint = production_endpoint
        self.shadow_endpoint = shadow_endpoint
        self.shadow_log_file = shadow_log_file

    async def predict(self, request_data: dict, request_id: str) -> dict:
        prod_task = asyncio.create_task(
            self._call_model(self.production_endpoint, request_data, "production")
        )
        shadow_task = asyncio.create_task(
            self._call_model(self.shadow_endpoint, request_data, "shadow")
        )

        # Return production response immediately; log shadow in background
        prod_result = await prod_task
        asyncio.create_task(
            self._log_shadow_result(shadow_task, request_id, request_data, prod_result)
        )
        return prod_result

    async def _call_model(self, endpoint: str, data: dict, label: str) -> dict:
        start = asyncio.get_event_loop().time()
        try:
            async with httpx.AsyncClient(timeout=2.0) as client:
                response = await client.post(endpoint, json=data)
                response.raise_for_status()
                result = response.json()
                result["_latency_ms"] = (asyncio.get_event_loop().time() - start) * 1000
                return result
        except Exception as e:
            return {"error": str(e), "_latency_ms": -1}

    async def _log_shadow_result(
        self, shadow_task, request_id, input_data, prod_result
    ):
        try:
            shadow_result = await shadow_task
        except Exception as e:
            shadow_result = {"error": str(e)}

        log_entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "request_id": request_id,
            "production_prediction": prod_result.get("prediction"),
            "production_latency_ms": prod_result.get("_latency_ms"),
            "shadow_prediction": shadow_result.get("prediction"),
            "shadow_latency_ms": shadow_result.get("_latency_ms"),
            "shadow_error": shadow_result.get("error"),
            "predictions_agree": (
                prod_result.get("prediction") == shadow_result.get("prediction")
            )
        }

        with open(self.shadow_log_file, "a") as f:
            f.write(json.dumps(log_entry) + "\n")


# --- Analyze shadow results offline ---
def analyze_shadow_results(log_file: str):
    import pandas as pd
    records = []
    with open(log_file) as f:
        for line in f:
            records.append(json.loads(line))

    df = pd.DataFrame(records)
    total = len(df)
    agreement_rate = df["predictions_agree"].mean()
    shadow_errors = df["shadow_error"].notna().sum()

    print(f"Total requests analyzed: {total:,}")
    print(f"Prediction agreement rate: {agreement_rate:.1%}")
    print(f"Shadow model errors: {shadow_errors} ({shadow_errors/total:.1%})")
    print(f"Avg production latency: {df['production_latency_ms'].mean():.1f}ms")
    print(f"Avg shadow latency: {df['shadow_latency_ms'].mean():.1f}ms")
    return df

Multi-Armed Bandits: Beyond Classic A/B Testing

The main limitation of classic A/B testing is the exploration cost: for the entire duration of the test, a fraction of users receives the potentially inferior model. If model B is clearly superior, we are "wasting" conversions of users assigned to A during the weeks of testing.

Multi-Armed Bandits (MAB) solve the exploration-exploitation problem: instead of maintaining a fixed split for the entire test duration, the algorithm dynamically adapts traffic toward the better-performing model, maximizing total conversions during the test itself. A 2025 study from Aimpoint Digital Labs shows that bandit approaches like Thompson Sampling can reduce cumulative regret by 20-35% compared to classic A/B testing in high-effect scenarios.

# thompson_sampling_bandit.py
# Multi-Armed Bandit with Thompson Sampling for ML model selection

import numpy as np
from dataclasses import dataclass
from typing import List, Tuple, Optional


@dataclass
class ModelArm:
    """Represents a model as a bandit arm."""
    name: str
    endpoint: str
    alpha: float = 1.0   # Successes (Beta distribution)
    beta: float = 1.0    # Failures (Beta distribution)

    @property
    def estimated_success_rate(self) -> float:
        return self.alpha / (self.alpha + self.beta)

    @property
    def total_observations(self) -> int:
        return int(self.alpha + self.beta - 2)

    def sample(self) -> float:
        """Sample from the Beta posterior (Thompson Sampling)."""
        return np.random.beta(self.alpha, self.beta)

    def update(self, reward: float):
        """
        Update distribution with new outcome.
        reward = 1.0 for success, 0.0 for failure
        """
        if reward >= 0.5:
            self.alpha += 1
        else:
            self.beta += 1


class ThompsonSamplingBandit:
    """
    Multi-Armed Bandit with Thompson Sampling.
    Optimal for adaptive ML model selection.
    """

    def __init__(self, models: List[ModelArm]):
        self.models = models
        self.selection_history = []

    def select_model(self) -> Tuple[int, ModelArm]:
        """Select model by sampling from Beta distributions."""
        samples = [arm.sample() for arm in self.models]
        best_idx = int(np.argmax(samples))
        self.selection_history.append(best_idx)
        return best_idx, self.models[best_idx]

    def update(self, arm_idx: int, reward: float):
        self.models[arm_idx].update(reward)

    def get_traffic_allocation(self) -> dict:
        if not self.selection_history:
            return {arm.name: 1/len(self.models) for arm in self.models}
        recent = self.selection_history[-1000:]
        total = len(recent)
        return {arm.name: recent.count(i) / total for i, arm in enumerate(self.models)}

    def check_convergence(self, min_observations: int = 500) -> Optional[str]:
        """
        Checks if the bandit has converged toward a clear winner.
        Returns winner name or None if still uncertain.
        """
        for arm in self.models:
            if arm.total_observations < min_observations:
                return None

        rates = sorted(
            [(arm.name, arm.estimated_success_rate) for arm in self.models],
            key=lambda x: x[1], reverse=True
        )
        best_name, best_rate = rates[0]
        _, second_rate = rates[1]

        # Declare winner if margin > 3%
        if best_rate - second_rate > 0.03:
            return best_name
        return None


# --- Usage example ---
models = [
    ModelArm(name="churn-model-v2", endpoint="http://model-v2:8080/predict"),
    ModelArm(name="churn-model-v3", endpoint="http://model-v3:8080/predict"),
]
bandit = ThompsonSamplingBandit(models)

# Simulate 1000 interactions
np.random.seed(42)
true_rates = {"churn-model-v2": 0.72, "churn-model-v3": 0.78}

for i in range(1000):
    arm_idx, selected_model = bandit.select_model()
    reward = float(np.random.random() < true_rates[selected_model.name])
    bandit.update(arm_idx, reward)

    if (i + 1) % 200 == 0:
        alloc = bandit.get_traffic_allocation()
        print(f"Step {i+1} - Traffic: {alloc}")
        winner = bandit.check_convergence(100)
        if winner:
            print(f"  => WINNER: {winner}")

Statistical Analysis: p-values, Confidence Intervals and Effect Size

At the end of the test period, statistical analysis must answer three distinct questions: is the observed difference statistically significant? How large is the effect? Is the effect practically relevant for the business?

# statistical_analysis.py
# Statistical analysis of A/B test results for ML models

import numpy as np
import math
from scipy.stats import norm


def analyze_ab_test_results(
    conversions_a: int, total_a: int,
    conversions_b: int, total_b: int,
    alpha: float = 0.05
) -> dict:
    """
    Complete statistical analysis of an A/B test on proportions.
    """
    p_a = conversions_a / total_a
    p_b = conversions_b / total_b

    # Z-test for difference of proportions
    p_pool = (conversions_a + conversions_b) / (total_a + total_b)
    se_pool = math.sqrt(p_pool * (1 - p_pool) * (1/total_a + 1/total_b))
    z_statistic = (p_b - p_a) / se_pool
    p_value = 2 * (1 - norm.cdf(abs(z_statistic)))

    # Confidence interval for the difference
    se_diff = math.sqrt(p_a*(1-p_a)/total_a + p_b*(1-p_b)/total_b)
    z_critical = norm.ppf(1 - alpha/2)
    diff = p_b - p_a
    ci_lower = diff - z_critical * se_diff
    ci_upper = diff + z_critical * se_diff

    # Effect size (Cohen's h for proportions)
    cohens_h = 2*math.asin(math.sqrt(p_b)) - 2*math.asin(math.sqrt(p_a))
    effect_magnitude = (
        "negligible" if abs(cohens_h) < 0.2
        else "small" if abs(cohens_h) < 0.5
        else "medium" if abs(cohens_h) < 0.8
        else "large"
    )

    relative_lift = (p_b - p_a) / p_a if p_a > 0 else 0
    is_significant = p_value < alpha

    return {
        "variant_a": {"rate": round(p_a, 4), "rate_pct": f"{p_a:.2%}"},
        "variant_b": {"rate": round(p_b, 4), "rate_pct": f"{p_b:.2%}"},
        "difference": {
            "absolute": round(diff, 4),
            "relative_lift_pct": f"{relative_lift:.2%}",
            "confidence_interval_95": (round(ci_lower, 4), round(ci_upper, 4))
        },
        "statistics": {
            "p_value": round(p_value, 6),
            "is_significant": is_significant,
            "cohens_h": round(cohens_h, 4),
            "effect_magnitude": effect_magnitude
        },
        "conclusion": (
            f"Model B is statistically better (p={p_value:.4f}, lift={relative_lift:.2%})"
            if is_significant and diff > 0
            else f"No significant difference detected (p={p_value:.4f})"
        )
    }


# --- Practical example ---
results = analyze_ab_test_results(
    conversions_a=1380, total_a=8500,
    conversions_b=1545, total_b=8200
)

print("=== A/B TEST RESULTS ===")
print(f"Model A: {results['variant_a']['rate_pct']} retention rate")
print(f"Model B: {results['variant_b']['rate_pct']} retention rate")
print(f"Relative lift: {results['difference']['relative_lift_pct']}")
print(f"95% CI: {results['difference']['confidence_interval_95']}")
print(f"p-value: {results['statistics']['p_value']}")
print(f"Effect size: {results['statistics']['effect_magnitude']}")
print(f"\nConclusion: {results['conclusion']}")

Bayesian A/B Testing

The frequentist approach with p-values has well-known limitations: a p-value is not the probability that model B is better (it is the probability of observing data as extreme as this if H0 were true). The Bayesian approach directly answers the question we care about: what is the probability that model B is better than A, and by how much?

The Bayesian approach also allows stopping the test when a sufficiently high probability is reached (e.g., 95%) that one model is the best, without the peeking problem typical of frequentism.

# bayesian_ab_test.py
# Bayesian A/B testing for ML models

import numpy as np


def bayesian_ab_test(
    successes_a: int, trials_a: int,
    successes_b: int, trials_b: int,
    prior_alpha: float = 1.0,
    prior_beta: float = 1.0,
    n_samples: int = 100_000
) -> dict:
    """
    Bayesian A/B test using Beta distribution as prior/posterior.
    Beta prior + Binomial likelihood = Beta posterior (conjugate pair).
    """
    # Update priors with observed data
    alpha_a = prior_alpha + successes_a
    beta_a = prior_beta + (trials_a - successes_a)
    alpha_b = prior_alpha + successes_b
    beta_b = prior_beta + (trials_b - successes_b)

    # Sample from posterior distributions
    samples_a = np.random.beta(alpha_a, beta_a, n_samples)
    samples_b = np.random.beta(alpha_b, beta_b, n_samples)

    # Probability that B is better than A
    prob_b_better = float(np.mean(samples_b > samples_a))

    # Relative lift distribution
    lift_samples = (samples_b - samples_a) / samples_a
    lift_mean = float(np.mean(lift_samples))
    ci_lower = float(np.percentile(lift_samples, 2.5))
    ci_upper = float(np.percentile(lift_samples, 97.5))
    prob_lift_2pct = float(np.mean(lift_samples > 0.02))

    # Expected loss: cost of choosing the wrong model
    expected_loss_a = float(np.mean(np.maximum(samples_b - samples_a, 0)))
    expected_loss_b = float(np.mean(np.maximum(samples_a - samples_b, 0)))

    return {
        "prob_b_better_than_a": round(prob_b_better, 4),
        "lift": {
            "mean": round(lift_mean, 4),
            "credible_interval_95pct": (round(ci_lower, 4), round(ci_upper, 4)),
            "prob_lift_above_2pct": round(prob_lift_2pct, 4)
        },
        "expected_loss": {
            "choose_a": round(expected_loss_a, 6),
            "choose_b": round(expected_loss_b, 6),
            "recommended_choice": "B" if expected_loss_b < expected_loss_a else "A"
        },
        "decision": (
            "Choose B" if prob_b_better > 0.95
            else "Choose A" if prob_b_better < 0.05
            else f"Uncertain (P(B>A) = {prob_b_better:.1%}) - collect more data"
        )
    }


# --- Example ---
result = bayesian_ab_test(
    successes_a=1380, trials_a=8500,
    successes_b=1545, trials_b=8200
)

print("=== BAYESIAN A/B TEST ===")
print(f"P(B > A) = {result['prob_b_better_than_a']:.1%}")
print(f"Mean lift: {result['lift']['mean']:.2%}")
print(f"95% Credible Interval: {result['lift']['credible_interval_95pct']}")
print(f"P(lift > 2%): {result['lift']['prob_lift_above_2pct']:.1%}")
print(f"Recommended choice: {result['expected_loss']['recommended_choice']}")
print(f"Decision: {result['decision']}")

Monitoring Tests with Prometheus and Grafana

An active A/B test in production must be monitored continuously. Do not just wait for the test to end to analyze results: ensure both variants work correctly on the technical side (latency, error rate, availability) and that business metrics align with initial expectations.

# Key PromQL queries for Grafana dashboards during A/B tests:

# 1. Traffic distribution between variants (should be ~50/50)
# sum by (variant) (rate(ab_test_requests_total[5m]))

# 2. P95 latency per variant
# histogram_quantile(0.95, sum by (variant, le) (
#   rate(ab_test_latency_seconds_bucket[5m])
# ))

# 3. Error rate per variant
# sum by (variant) (rate(ab_test_errors_total[5m])) /
# sum by (variant) (rate(ab_test_requests_total[5m]))

# 4. Prediction distribution per variant (drift indicator)
# sum by (variant, prediction_bucket) (rate(ab_test_predictions_total[1h]))

---
# prometheus_ab_alerts.yml
groups:
  - name: ab_test_alerts
    rules:
      - alert: ABTestTrafficImbalance
        expr: |
          abs(
            sum(rate(ab_test_requests_total{variant="A"}[10m]))
            / sum(rate(ab_test_requests_total[10m]))
            - 0.5
          ) > 0.10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "A/B test traffic imbalance detected"
          description: "Traffic split deviates more than 10% from 50/50"

      - alert: ABTestVariantBHighErrors
        expr: |
          (
            sum(rate(ab_test_errors_total{variant="B"}[5m]))
            / sum(rate(ab_test_requests_total{variant="B"}[5m]))
          ) > 2 * (
            sum(rate(ab_test_errors_total{variant="A"}[5m]))
            / sum(rate(ab_test_requests_total{variant="A"}[5m]))
          )
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Variant B error rate is more than 2x variant A"
          description: "Consider rolling back variant B immediately"

Budget Under 5K EUR/Year: Complete A/B Testing Stack

A complete A/B testing system for ML models does not require an enterprise budget. With an open-source stack and a small VPS, you can have everything you need:

FastAPI router + Python stats: Open-source, free
Prometheus + Grafana: Open-source, free
VPS hosting (Hetzner/OVH): 20-40 EUR/month (240-480 EUR/year)
Feature flag service (Unleash self-hosted): Open-source, free
MLflow model registry: Open-source, free
Estimated total infrastructure cost: 300-600 EUR/year

Best Practices and Anti-Patterns

      Pre-Experiment Checklist
      
          Calculate sample size before starting: never run a test "until
          something interesting appears". The pre-determined sample size is non-negotiable.
        
          Define ONE primary metric: optimizing for two metrics simultaneously
          makes the decision ambiguous. Guardrail metrics exist to prevent regressions,
          not to declare winners.
        
          Run an A/A test first: before launching the real test, run an
          A/A test (same model on both variants) to verify the assignment mechanism has
          no bugs causing artificial differences.
        
          Document hypotheses before running: write down why you expect
          model B to be better and by how much. This prevents HARKing bias.
        
          Analyze new users separately: new users have no history with
          either model and exhibit different behaviors. Analyze their results independently.

Anti-Patterns to Avoid Absolutely

Continuous peeking: checking results daily and stopping at first statistical significance increases false positives to 30%. Use Sequential Probability Ratio Tests (SPRT) if you need early stopping.
HARKing (Hypothesizing After Results are Known): analyzing data to find any significant difference, then telling the story as if it were hypothesized upfront. With 20 segments tested, one will be significant by chance alone with alpha = 0.05.
Ignoring metric variance: metrics like revenue per user have very heavy tails. A single whale user can make a non-existent effect appear significant. Use bootstrap or non-parametric tests for non-Gaussian metrics.
Tests that are too short: weekly effects and novelty effects (users respond positively to novelty for 1-2 days then revert) require at least 2 weeks to average out.
Feedback loops in ML systems: in systems with feedback loops (recommendations, dynamic pricing), predictions from the two variants are not independent. Model this correlation explicitly.

When to Use Which Approach

Strategy Selection Guide

Shadow Mode: use when the model is completely new, not yet validated, or when the risk of a critical bug is too high. Always the first step before any test with real users.
Canary Deployment: use to reduce the operational risk of a new deployment. Ideal for critical models (fraud, pricing) where a regression has immediate financial impact.
Classic A/B Test (50/50): use when you want to measure the business effect with maximum statistical power and operational risk is low. Requires sufficient sample size and a fast feedback loop.
Multi-Armed Bandit: use when feedback is fast (within hours/days), the exploration cost is high and you prefer maximizing conversions during the test. Not ideal for small effects with slow feedback.
Bayesian A/B: use when you want flexible stopping rules, to interpret probabilities directly or when you have prior information from previous experiments. Ideal for teams that find p-values confusing.

Conclusion and Next Steps

A/B testing for ML models is much more than a simple traffic split. It requires rigorous statistical design before any implementation, choosing the right strategy based on context (shadow, canary, 50/50, bandit), continuous monitoring during the test, and correct statistical analysis at the end.

The difference between a team that does A/B testing correctly and one that does it poorly is not in the complexity of the code but in the discipline of the process: define hypotheses first, do not look at data during the test, analyze everything correctly afterward. With the open-source stack described in this guide (FastAPI, Prometheus, Grafana, scipy, numpy), you can implement a production-grade system on a minimal budget.

The natural next step is to integrate A/B testing with ML governance: every decision to promote a model to production must be documented, auditable and compliant with ethical and regulatory standards. We will cover this in the next article on ML Governance.

Continue the MLOps Series

Previous article: Scaling ML on Kubernetes - orchestrating deployments with KubeFlow and Seldon Core
Next article: ML Governance: Compliance, Audit, Ethics - EU AI Act, explainability and fairness
Related: Detecting Model Drift: Monitoring and Automated Retraining
Related: Serving ML Models: FastAPI, Uvicorn and Containerization
Related series: Deep Learning Avanzato - A/B testing for complex neural models