MLOps 101: From Experiment to Production
Every data scientist has experienced this moment: the model performs flawlessly in the Jupyter notebook, metrics look stellar, the team celebrates during the demo. Then comes the dreaded question: "When can we ship this to production?". Silence follows. Industry estimates suggest that up to 85% of machine learning projects never reach production. Not because the models are broken, but because the infrastructure, processes, and discipline required to run them reliably and continuously simply do not exist.
MLOps (Machine Learning Operations) exists precisely to bridge this gap. It is not a single tool or technology, but a set of practices, tools, and cultural shifts that transform isolated experiments into robust, production-grade ML systems. In this article, we will explore what MLOps means, why it has become indispensable, and how to start applying it concretely, even on a limited budget.
What You Will Learn
- Why most ML projects never reach production and how MLOps solves this problem
- The key differences between DevOps and MLOps
- Google's 3-level MLOps maturity model
- The complete lifecycle of an ML model in production
- How to track experiments with MLflow
- How to serve a model with FastAPI and Docker
- An open-source stack to get started for less than $5,000/year
What Is MLOps and Why Does It Matter
MLOps applies DevOps principles to the machine learning lifecycle. Just as DevOps unified development and operations for traditional software, MLOps brings together data science, engineering, and operations for ML systems. The goal is to automate and make reproducible every stage: from data preparation to training, from validation to deployment, from monitoring to retraining.
DevOps vs MLOps: Key Differences
Engineers coming from a software background might assume that standard DevOps practices translate directly to ML. In reality, fundamental differences make MLOps a discipline of its own.
| Aspect | DevOps | MLOps |
|---|---|---|
| Artifact | Source code | Code + Data + Model |
| Versioning | Git for code | Git + DVC for data and models |
| Testing | Unit tests, integration tests | Data validation, model validation, A/B tests |
| CI/CD | Build, test, deploy code | Train, validate, deploy model |
| Monitoring | Latency, errors, uptime | Data drift, concept drift, model performance |
| Degradation | Explicit bugs | Silent degradation over time |
| Reproducibility | Same code = same output | Same code + same data + same seed = same output |
The most critical difference is silent degradation. A traditional software service either works or it does not: a bug produces an error. An ML model can keep returning predictions without any technical errors while its accuracy steadily deteriorates because incoming data has shifted from the training distribution. Without targeted monitoring, no one notices until users start complaining.
The ML "Valley of Death"
Gartner predicted that 30% of generative AI projects would be abandoned after the proof-of-concept stage by the end of 2025, due to poor data quality, inadequate risk controls, escalating costs, or unclear business value. MLOps systematically addresses each of these root causes.
The MLOps Market: Numbers and Trends
The MLOps market is expanding at a remarkable pace. According to industry analyses, the global MLOps market was valued between $2 and $3 billion in 2025, with projections ranging from $25 to $56 billion by 2035, at a compound annual growth rate (CAGR) between 29% and 42% depending on the source.
These numbers reflect a concrete reality: organizations are investing heavily in bringing ML models to production. According to market estimates, over 70% of large enterprises in North America run production AI workloads, and more than 55% have integrated automated model monitoring. Yet nearly two-thirds of organizations remain stuck in the pilot stage, unable to scale AI across the enterprise.
The 3 MLOps Maturity Levels
Google defined a 3-level MLOps maturity model that has become the de facto industry standard. Each level represents an increasing degree of automation and reliability in the ML lifecycle.
Level 0: Manual Process
At Level 0, every step is manual. The data scientist works in a notebook, trains the model locally, exports it as a file, and hands it to the engineering team who wraps it in an API. There is no automation, no monitoring, no automatic retraining.
| Characteristic | Level 0 |
|---|---|
| Training | Manual, in a notebook |
| Deployment | Manual, file handoff (.pkl or .h5) |
| Monitoring | None or manual |
| Retraining | Only on explicit request |
| Reproducibility | Poor or nonexistent |
This level is common in organizations starting to apply ML to their use cases. It may be sufficient when models are rarely updated and data changes slowly, but it does not scale.
Level 1: ML Pipeline Automation
At Level 1, training is automated through an ML pipeline. Instead of deploying a single model, you deploy the entire pipeline that produces it. This enables continuous training: when new data arrives, the pipeline automatically retrains the model.
| Characteristic | Level 1 |
|---|---|
| Training | Automated via pipeline |
| Deployment | Automated pipeline |
| Monitoring | Model performance + retraining triggers |
| Retraining | Automatic on new data or degradation |
| Reproducibility | Good (versioned pipelines) |
Level 1 is sufficient when data changes frequently but the ML approach remains stable. The pipeline stays the same but is re-executed periodically with fresh data.
Level 2: CI/CD for Machine Learning
At Level 2, a full CI/CD system purpose-built for ML is added. Not only does the data change, but so does the pipeline code, features, hyperparameters, and model architecture. Every change goes through automated tests, validation, and controlled deployment.
| Characteristic | Level 2 |
|---|---|
| Training | Automated + CI/CD on the pipeline itself |
| Deployment | Blue/green, canary, A/B testing |
| Monitoring | Full: data drift, concept drift, performance, latency |
| Retraining | Automatic with validation and rollback |
| Reproducibility | Complete (code + data + environment versioned) |
Reaching Level 2 is the target for mature organizations. It requires significant investment in infrastructure and culture, but it is the only sustainable way to manage dozens or hundreds of models in production.
The MLOps Lifecycle
The lifecycle of an ML model in production is an iterative process spanning six core phases. Unlike traditional software development, this cycle never truly ends: a production model requires continuous maintenance.
+----------+ +---------+ +----------+
| DATA |---->| TRAIN |---->| EVALUATE |
| Collect | | Feature | | Validate |
| Clean | | Train | | Compare |
| Version | | Tune | | Approve |
+----------+ +---------+ +----------+
^ |
| v
+----------+ +---------+ +----------+
| RETRAIN |<----| MONITOR |<----| DEPLOY |
| Trigger | | Drift | | Stage |
| Schedule | | Metrics | | Canary |
| Auto | | Alert | | Release |
+----------+ +---------+ +----------+
1. Data: Collection, Cleaning, and Versioning
Everything starts with data. In this phase, raw data is collected, cleaned (handling missing values, outliers, duplicates), transformed into useful features, and versioned. Data versioning is essential: to reproduce a model, you need to know exactly which data was used for training. Tools like DVC (Data Version Control) version large datasets in a Git-like manner.
2. Train: Feature Engineering and Model Training
With data ready, features are built, an algorithm is chosen, and the model is trained. Each experiment (combination of hyperparameters, features, architecture) is tracked with its parameters and metrics. Tools like MLflow make this process systematic and reproducible.
3. Evaluate: Validation and Comparison
The trained model is validated against predefined metrics (accuracy, F1-score, RMSE, AUC) and compared with the version currently in production. If the new model does not meet minimum thresholds or does not improve upon the previous one, it is not promoted.
4. Deploy: Staging, Canary, and Release
The approved model progresses through environments: staging for integration tests, canary for validation with limited real traffic, and finally full production. Strategies like blue/green deployment and canary releases minimize risk.
5. Monitor: Drift, Metrics, and Alerts
In production, the model is monitored continuously. Both technical metrics (latency, throughput, errors) and ML metrics (accuracy on real data, prediction distribution, data drift) are tracked. Alerts fire when metrics fall below thresholds.
6. Retrain: Triggers and Automation
When monitoring detects degradation, retraining is triggered. This can be scheduled (e.g., weekly), trigger-based (e.g., accuracy below 90%), or manual. The new model goes through the evaluate and deploy phases again.
The Open-Source MLOps Stack
One of MLOps' greatest strengths is a mature open-source ecosystem covering every lifecycle phase. You do not need expensive enterprise platforms to get started: the right combination of open-source tools builds a complete MLOps pipeline.
| Phase | Tool | Purpose |
|---|---|---|
| Data Versioning | DVC | Version datasets and models, integrated with Git |
| Experiment Tracking | MLflow | Log parameters, metrics, and artifacts per experiment |
| Model Registry | MLflow Model Registry | Version and promote models (staging/production) |
| Pipeline Orchestration | Prefect / Airflow | Workflow orchestration, scheduling, retries |
| Model Serving | FastAPI + Docker | REST API for predictions, containerized |
| Containerization | Docker + K8s | Reproducible environments, horizontal scaling |
| Monitoring | Prometheus + Grafana | Metrics, dashboards, alerting |
| Data Validation | Great Expectations | Automated data quality tests |
From Notebook to Pipeline: A Practical Example
Let us walk through the most common transition every ML team faces: refactoring code written in a Jupyter notebook into a modular, reproducible pipeline. We will take a real-world classification example and restructure it step by step.
Before: The Monolithic Notebook
Here is the typical notebook where everything lives in a single file: no separation of concerns, no logging, no versioning.
# Cell 1: Everything in one notebook
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score
import pickle
# Load data
df = pd.read_csv("data/customers.csv")
# Inline feature engineering
df["age_group"] = pd.cut(df["age"], bins=[0, 25, 45, 65, 100],
labels=["young", "adult", "senior", "elderly"])
df["total_spend"] = df["orders"] * df["avg_order_value"]
# Split
X = df[["age", "total_spend", "visits", "days_since_last"]]
y = df["churned"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Training - hardcoded parameters
model = RandomForestClassifier(n_estimators=100, max_depth=10)
model.fit(X_train, y_train)
# Evaluation - print to screen
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"F1: {f1_score(y_test, y_pred)}")
# Save - pickle with no versioning
with open("model.pkl", "wb") as f:
pickle.dump(model, f)
print("Model saved!")
Problems with the Monolithic Notebook
- Not reproducible: no seed, no data versioning
- Not tracked: parameters and metrics live only in notebook output
- Not testable: no isolated functions to test
- Not deployable: a pickle file is not an API
- Not maintainable: changing one feature requires re-running everything
After: A Modular Pipeline
We restructure the code into separate modules, each with a single responsibility. Every function is testable, every parameter is configurable, and every metric is tracked.
"""Module for data preparation and transformation."""
import pandas as pd
from pathlib import Path
from typing import Tuple
def load_data(path: str) -> pd.DataFrame:
"""Load the dataset from the specified path."""
filepath = Path(path)
if not filepath.exists():
raise FileNotFoundError(f"Dataset not found: {path}")
return pd.read_csv(filepath)
def create_features(df: pd.DataFrame) -> pd.DataFrame:
"""Create derived features for the model."""
result = df.copy()
result["age_group"] = pd.cut(
result["age"],
bins=[0, 25, 45, 65, 100],
labels=["young", "adult", "senior", "elderly"]
)
result["total_spend"] = result["orders"] * result["avg_order_value"]
return result
def split_data(
df: pd.DataFrame,
target_col: str = "churned",
feature_cols: list = None,
test_size: float = 0.2,
random_state: int = 42
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]:
"""Split data into train/test with a fixed seed for reproducibility."""
from sklearn.model_selection import train_test_split
if feature_cols is None:
feature_cols = ["age", "total_spend", "visits", "days_since_last"]
X = df[feature_cols]
y = df[target_col]
return train_test_split(X, y, test_size=test_size, random_state=random_state)
"""Module for model training and evaluation."""
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from typing import Dict, Any
import pandas as pd
def train_model(
X_train: pd.DataFrame,
y_train: pd.Series,
n_estimators: int = 100,
max_depth: int = 10,
random_state: int = 42
) -> RandomForestClassifier:
"""Train a RandomForestClassifier with configurable parameters."""
model = RandomForestClassifier(
n_estimators=n_estimators,
max_depth=max_depth,
random_state=random_state
)
model.fit(X_train, y_train)
return model
def evaluate_model(
model: RandomForestClassifier,
X_test: pd.DataFrame,
y_test: pd.Series
) -> Dict[str, float]:
"""Evaluate the model and return a metrics dictionary."""
y_pred = model.predict(X_test)
return {
"accuracy": accuracy_score(y_test, y_pred),
"f1_score": f1_score(y_test, y_pred),
"precision": precision_score(y_test, y_pred),
"recall": recall_score(y_test, y_pred),
}
"""Main pipeline orchestrating all stages."""
from src.data.preprocessing import load_data, create_features, split_data
from src.models.trainer import train_model, evaluate_model
import yaml
from pathlib import Path
def run_pipeline(config_path: str = "config.yaml") -> None:
"""Run the full ML pipeline using an external config."""
# 1. Load configuration
with open(config_path) as f:
config = yaml.safe_load(f)
# 2. Data preparation
print("[1/4] Loading data...")
df = load_data(config["data"]["path"])
df = create_features(df)
# 3. Split
print("[2/4] Splitting train/test...")
X_train, X_test, y_train, y_test = split_data(
df,
test_size=config["data"]["test_size"],
random_state=config["data"]["random_state"]
)
# 4. Training
print("[3/4] Training model...")
model = train_model(
X_train, y_train,
n_estimators=config["model"]["n_estimators"],
max_depth=config["model"]["max_depth"],
random_state=config["model"]["random_state"]
)
# 5. Evaluation
print("[4/4] Evaluating...")
metrics = evaluate_model(model, X_test, y_test)
for name, value in metrics.items():
print(f" {name}: {value:.4f}")
if __name__ == "__main__":
run_pipeline()
# config.yaml - All parameters in one file
data:
path: "data/customers.csv"
test_size: 0.2
random_state: 42
feature_cols:
- age
- total_spend
- visits
- days_since_last
model:
algorithm: "random_forest"
n_estimators: 100
max_depth: 10
random_state: 42
evaluation:
metrics:
- accuracy
- f1_score
- precision
- recall
min_accuracy: 0.85
Benefits of a Modular Pipeline
- Reproducible: fixed seed, externalized configuration, versionable data
- Testable: every function is isolated and can have dedicated unit tests
- Maintainable: changing features does not affect training and vice versa
- Configurable: change hyperparameters without touching code
- Automatable: the pipeline can be triggered by CI/CD
Experiment Tracking with MLflow
How many times have you tweaked a hyperparameter and then forgotten which combination yielded the best result? Experiment tracking solves this by automatically recording the parameters, metrics, and artifacts of every experiment.
MLflow is the most widely adopted open-source tool for experiment tracking. It provides a tracking server with a web UI for visualizing and comparing experiments, a Python API for logging, and a Model Registry for managing model lifecycles.
Setup and First Experiment
# Installation
pip install mlflow
# Start a local tracking server
mlflow server --host 127.0.0.1 --port 5000
"""ML pipeline with experiment tracking via MLflow."""
import mlflow
import mlflow.sklearn
from src.data.preprocessing import load_data, create_features, split_data
from src.models.trainer import train_model, evaluate_model
def run_tracked_pipeline(config: dict) -> None:
"""Run the pipeline while tracking everything with MLflow."""
# Set the tracking URI (local or remote server)
mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("churn-prediction")
with mlflow.start_run(run_name="rf-baseline") as run:
# Log parameters
mlflow.log_param("algorithm", "RandomForest")
mlflow.log_param("n_estimators", config["model"]["n_estimators"])
mlflow.log_param("max_depth", config["model"]["max_depth"])
mlflow.log_param("test_size", config["data"]["test_size"])
mlflow.log_param("random_state", config["data"]["random_state"])
# Data preparation
df = load_data(config["data"]["path"])
df = create_features(df)
X_train, X_test, y_train, y_test = split_data(
df,
test_size=config["data"]["test_size"],
random_state=config["data"]["random_state"]
)
# Log dataset dimensions
mlflow.log_param("train_samples", len(X_train))
mlflow.log_param("test_samples", len(X_test))
mlflow.log_param("n_features", X_train.shape[1])
# Training
model = train_model(
X_train, y_train,
n_estimators=config["model"]["n_estimators"],
max_depth=config["model"]["max_depth"]
)
# Evaluation
metrics = evaluate_model(model, X_test, y_test)
# Log metrics
for name, value in metrics.items():
mlflow.log_metric(name, value)
# Log the model as an artifact
mlflow.sklearn.log_model(
model,
artifact_path="model",
registered_model_name="churn-classifier"
)
# Log the config as an artifact
mlflow.log_artifact("config.yaml")
print(f"Run ID: {run.info.run_id}")
print(f"Metrics: {metrics}")
After running several experiments, open http://127.0.0.1:5000 in your browser.
The MLflow UI displays a table of all experiments, letting you compare metrics, sort by
performance, and visualize parameter-vs-metric charts.
Model Registry: Versioning Models
Just as code is versioned with Git, ML models should be versioned with a Model Registry. MLflow Model Registry provides a centralized system for managing model lifecycles through three stages.
| Stage | Description | Used By |
|---|---|---|
| None / Staging | Model under testing and validation | Data scientists, QA |
| Production | Approved model serving real traffic | Serving API, end users |
| Archived | Retired model kept for auditing | Compliance, rollback |
"""Model lifecycle management with MLflow Model Registry."""
from mlflow.tracking import MlflowClient
client = MlflowClient("http://127.0.0.1:5000")
# Retrieve the latest staging version
latest_versions = client.get_latest_versions(
name="churn-classifier",
stages=["Staging"]
)
if latest_versions:
version = latest_versions[0].version
print(f"Model in staging: v{version}")
# Promote to Production after validation
client.transition_model_version_stage(
name="churn-classifier",
version=version,
stage="Production",
archive_existing_versions=True # Archive previous version
)
print(f"Model v{version} promoted to Production")
# Load the production model for inference
import mlflow.pyfunc
model = mlflow.pyfunc.load_model("models:/churn-classifier/Production")
prediction = model.predict(new_data)
Deployment: FastAPI + Docker
A production ML model is typically exposed as a REST API. FastAPI is the ideal choice for Python: it is fast (ASGI-based), generates automatic documentation (OpenAPI/Swagger), and has excellent data validation through Pydantic. By containerizing with Docker, we get an artifact deployable anywhere.
"""REST API for serving ML model predictions."""
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
import mlflow.pyfunc
import pandas as pd
import logging
from typing import List
# Logging configuration
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
app = FastAPI(
title="Churn Prediction API",
description="ML-based churn prediction API",
version="1.0.0"
)
class PredictionRequest(BaseModel):
"""Prediction request schema."""
age: int = Field(..., ge=0, le=120, description="Customer age")
total_spend: float = Field(..., ge=0, description="Total spending")
visits: int = Field(..., ge=0, description="Number of visits")
days_since_last: int = Field(..., ge=0, description="Days since last visit")
class PredictionResponse(BaseModel):
"""Prediction response schema."""
prediction: int
probability: float
model_version: str
# Load model at startup
MODEL_NAME = "churn-classifier"
MODEL_STAGE = "Production"
model = None
model_version = "unknown"
@app.on_event("startup")
async def load_model():
"""Load the MLflow model on server startup."""
global model, model_version
try:
model_uri = f"models:/{MODEL_NAME}/{MODEL_STAGE}"
model = mlflow.pyfunc.load_model(model_uri)
model_version = model.metadata.run_id[:8]
logger.info(f"Model loaded: {MODEL_NAME} ({model_version})")
except Exception as e:
logger.error(f"Model loading error: {e}")
raise
@app.get("/health")
async def health_check():
"""Health check endpoint."""
return {"status": "healthy", "model_loaded": model is not None}
@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
"""Generate a churn prediction for a customer."""
if model is None:
raise HTTPException(status_code=503, detail="Model not loaded")
try:
input_data = pd.DataFrame([request.model_dump()])
prediction = model.predict(input_data)
probability = float(prediction[0]) if hasattr(prediction[0], '__float__') else 0.0
return PredictionResponse(
prediction=int(prediction[0]),
probability=probability,
model_version=model_version
)
except Exception as e:
logger.error(f"Prediction error: {e}")
raise HTTPException(status_code=500, detail="Prediction error")
@app.post("/predict/batch", response_model=List[PredictionResponse])
async def predict_batch(requests: List[PredictionRequest]):
"""Generate batch predictions for multiple customers."""
if model is None:
raise HTTPException(status_code=503, detail="Model not loaded")
input_data = pd.DataFrame([r.model_dump() for r in requests])
predictions = model.predict(input_data)
return [
PredictionResponse(
prediction=int(p),
probability=float(p),
model_version=model_version
)
for p in predictions
]
# Dockerfile for ML model serving
FROM python:3.11-slim
WORKDIR /app
# System dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc \
&& rm -rf /var/lib/apt/lists/*
# Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Application code
COPY src/serving/ ./serving/
COPY config.yaml .
# Service port
EXPOSE 8000
# Healthcheck
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
# Start with uvicorn
CMD ["uvicorn", "serving.app:app", "--host", "0.0.0.0", "--port", "8000"]
# Build the image
docker build -t churn-api:v1.0.0 .
# Start the container
docker run -d \
--name churn-api \
-p 8000:8000 \
-e MLFLOW_TRACKING_URI=http://mlflow-server:5000 \
churn-api:v1.0.0
# Test the API
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"age": 35, "total_spend": 1250.50, "visits": 12, "days_since_last": 45}'
Production Monitoring
Deployment is not the end of the journey but the beginning of a critical new phase: monitoring. A production model degrades over time because the world changes and data shifts with it. Monitoring must cover three main areas.
Metrics to Track
| Category | Metrics | Tool |
|---|---|---|
| Infrastructure | Latency (p50, p95, p99), throughput, HTTP errors, CPU/RAM | Prometheus + Grafana |
| Model | Accuracy, F1-score, prediction distribution, confidence | MLflow + custom metrics |
| Data | Data drift, feature drift, missing values, input distribution | Evidently AI / Great Expectations |
Data Drift vs Concept Drift
It is critical to distinguish between two types of model degradation:
- Data Drift: the distribution of incoming data shifts compared to the training set. Example: a model trained on customers aged 25-45 starts receiving requests for customers aged 60+.
- Concept Drift: the relationship between inputs and outputs changes. Example: after a pandemic, customer churn patterns are completely different, but the input features have the same distribution.
"""Data drift detection using statistical tests."""
import numpy as np
from scipy import stats
from typing import Dict, Tuple
def detect_drift(
reference_data: np.ndarray,
production_data: np.ndarray,
feature_names: list,
threshold: float = 0.05
) -> Dict[str, Dict]:
"""
Detect data drift by comparing distributions with the KS test.
Args:
reference_data: training data (reference)
production_data: production data (current)
feature_names: feature names
threshold: p-value threshold for drift (default 0.05)
Returns:
Drift report for each feature
"""
drift_report = {}
for i, feature in enumerate(feature_names):
ref_values = reference_data[:, i]
prod_values = production_data[:, i]
# Kolmogorov-Smirnov test
ks_stat, p_value = stats.ks_2samp(ref_values, prod_values)
drift_detected = p_value < threshold
drift_report[feature] = {
"ks_statistic": round(ks_stat, 4),
"p_value": round(p_value, 4),
"drift_detected": drift_detected,
"ref_mean": round(float(np.mean(ref_values)), 4),
"prod_mean": round(float(np.mean(prod_values)), 4),
}
if drift_detected:
print(f"DRIFT DETECTED on '{feature}': "
f"KS={ks_stat:.4f}, p={p_value:.4f}")
return drift_report
When to Trigger Retraining
Not every drift warrants immediate retraining. Define clear thresholds: data drift on critical features, accuracy drop greater than 5%, or a significantly skewed prediction distribution. Avoid excessive retraining, which can introduce instability.
Getting Started on Less Than $5,000/Year
MLOps does not have to mean six-figure enterprise platforms. For small to mid-sized teams, it is entirely possible to build effective MLOps infrastructure using open-source tools and minimal cloud spend.
Proposed Stack for Small Teams
| Component | Solution | Annual Cost |
|---|---|---|
| Code | GitHub Free / GitLab CE | $0 |
| Data Versioning | DVC + Google Cloud Storage (5 GB free) | $0 - $50 |
| Experiment Tracking | MLflow on a budget VM | $200 - $500 |
| Training | Google Colab Pro / spot VMs | $120 - $600 |
| Serving | FastAPI on a VM (2 vCPU, 4 GB RAM) | $300 - $800 |
| Monitoring | Prometheus + Grafana (self-hosted) | $0 (same VM) |
| CI/CD | GitHub Actions (2,000 min/month free) | $0 |
| Container Registry | GitHub Container Registry | $0 |
Estimated total: $620 - $1,950/year, well below the $5,000 threshold. This stack supports up to 5-10 models in production with moderate traffic volumes (thousands of predictions per day).
Cost Reduction Tips
- Spot/preemptible VMs: up to 70% savings for non-urgent training
- Autoscaling: scale to zero when there are no requests
- Model compression: smaller models = fewer serving resources
- Batch inference: if real-time predictions are not needed, use nightly batches
- Multi-tenant: a single MLflow/Grafana infrastructure for all projects
MLOps Project Structure
To wrap up with something immediately actionable, here is the recommended folder structure for an MLOps project. This organization follows separation of concerns and facilitates automation, testing, and collaboration.
churn-prediction/
data/
raw/ # Raw data (versioned with DVC)
processed/ # Transformed data
data.dvc # DVC tracking file
src/
data/
preprocessing.py # Cleaning and feature engineering
validation.py # Data quality validation
models/
trainer.py # Training logic
evaluator.py # Evaluation and metrics
serving/
app.py # FastAPI application
schemas.py # Pydantic schemas
monitoring/
drift_detector.py # Drift detection
metrics.py # Custom metrics
pipeline.py # Pipeline orchestration
tests/
test_preprocessing.py
test_trainer.py
test_api.py
config.yaml # Pipeline configuration
Dockerfile # Serving container
docker-compose.yaml # Full local stack
requirements.txt # Python dependencies
.dvc/ # DVC configuration
.github/
workflows/
train.yaml # CI/CD for training
deploy.yaml # CI/CD for deployment
mlflow/ # MLflow artifacts (local)
README.md
Conclusion and Next Steps
MLOps is not a luxury reserved for Big Tech. It is a necessity for anyone who wants to bring ML models to production reliably and sustainably. In this article we covered the fundamentals: from understanding the problem (why ML projects fail) to concrete solutions (modular pipelines, experiment tracking, model registry, containerized serving, and monitoring).
The key is to start incrementally. You do not need to reach Google's Level 2 maturity model on day one. Start at Level 0 with good practices:
- Right now: Break your notebook code into modules. Use a config.yaml.
- Week 1: Add MLflow for experiment tracking.
- Week 2: Containerize your model with FastAPI + Docker.
- Month 1: Implement a CI/CD pipeline with GitHub Actions.
- Month 2: Add monitoring with Prometheus and basic alerting.
- Month 3: Implement DVC for data versioning.
In the upcoming articles in this series, we will dive deeper into each component: data management with DVC, ML-specific CI/CD pipelines, advanced monitoring with Evidently AI, and scalable deployment on Kubernetes. Each article will be hands-on, with working code and step-by-step instructions.
Series Roadmap
- Article 2: DVC - Data Versioning for ML
- Article 3: MLflow Deep Dive - Advanced Experiment Tracking
- Article 4: CI/CD for Machine Learning with GitHub Actions
- Article 5: Feature Stores and Production Feature Engineering
- Article 6: Scalable Model Serving with Kubernetes
- Article 7: Advanced Monitoring: Data Drift and Evidently AI
- Article 8: Governance, Compliance, and Responsible ML







