Dataset and Model Versioning with DVC: A Production MLOps Guide
Picture this: you train a model that hits 94% accuracy, deploy it to production, and then three months later nobody can tell you exactly which dataset version was used, which hyperparameters were active, or how to reproduce the result. The model is live, but it is completely irreproducible. This scenario is the norm, not the exception, in ML teams that have not adopted structured versioning practices.
Versioning in MLOps goes beyond source code: it covers datasets, trained models, pipelines, and configurations. Git handles source code beautifully, but it was never designed for binary files weighing hundreds of gigabytes. That is precisely where DVC (Data Version Control) comes in - the open-source tool that brings Git-like versioning principles to the world of data and machine learning models.
In November 2025, lakeFS acquired DVC, confirming the tool's centrality in the MLOps ecosystem while committing to keeping it 100% open-source. This article gives you a deep dive into DVC: setup, pipelines, remote storage, Git and MLflow integration, a structural comparison with lakeFS, and battle-tested best practices for teams of any size.
What You Will Learn
- Why versioning is non-negotiable in MLOps and how it differs from code versioning
- Complete DVC setup: initialization, data tracking, .dvc files, and automatic .gitignore management
- DVC pipelines with dvc.yaml: stages, deps, outs, params, and the dvc.lock file
- Remote storage configuration: AWS S3, Google Cloud Storage, and Azure Blob Storage
- DVC Python API: dvc.api.open() and dvc.api.get_url() for programmatic access
- DVC + MLflow integration: linking dataset versions to experiment runs
- Architectural comparison of DVC vs lakeFS: which tool fits which scenario
- Best practices and cost-effective stack for teams with a budget under 5K EUR/year
The MLOps and Machine Learning in Production Series
| # | Article | Focus |
|---|---|---|
| 1 | MLOps 101: From Experiment to Production | Foundations and lifecycle |
| 2 | ML Pipelines with CI/CD | GitHub Actions and Docker for ML |
| 3 | You are here - Dataset and Model Versioning | DVC vs lakeFS |
| 4 | Experiment Tracking with MLflow | Tracking, model registry, comparison |
| 5 | Model Drift Detection | Monitoring and automated retraining |
| 6 | Serving with FastAPI + Uvicorn | Production model deployment |
| 7 | Scaling ML on Kubernetes | KubeFlow and Seldon Core |
| 8 | A/B Testing ML Models | Methodology and implementation |
| 9 | ML Governance | Compliance, EU AI Act, ethics |
| 10 | Case Study: Churn Prediction | End-to-end production pipeline |
The Versioning Problem in Machine Learning
In a traditional software project, Git solves virtually all versioning challenges: code is lightweight, text-based, and naturally suited to diff and merge operations. In an ML project, code is only one piece of the puzzle. Reproducing a model requires simultaneously tracking:
- Training and validation datasets: often gigabytes or terabytes of structured or unstructured data
- Model artifacts: .pkl, .pt, .h5 files that can weigh hundreds of megabytes
- Hyperparameters and configurations: learning rate, architecture choices, preprocessing steps
- Metrics and results: accuracy, F1, AUC-ROC for every experiment run
- Environment dependencies: Python version, library pinning, CUDA drivers
Git is not designed for large binary files: pushing a 500 MB model file once would cause every subsequent clone to download the entire history, making the repository unmanageable within months. The solution is not to abandon versioning, but to use the right tool for each artifact type.
The Cost of Missing ML Versioning
| Scenario | Cost Without Versioning | DVC Solution |
|---|---|---|
| Degraded production model | No rollback to the previous model | git checkout + dvc checkout |
| Corrupted dataset in pipeline | Full re-download and processing | dvc checkout to previous dataset version |
| Regulatory audit (EU AI Act) | Cannot prove which data trained the model | Dataset hash tracked in .dvc file |
| Team collaboration | Each data scientist uses different data versions | dvc pull synchronizes the entire team |
| Experiment reproduction | Results not replicable after 6 months | git checkout + dvc repro rebuilds everything |
DVC Setup and Configuration
DVC integrates into an existing Git repository as an additional layer. It requires no central server, works locally, and scales naturally to cloud storage as the team grows. Installation is straightforward via pip, with optional extras for each storage backend:
# Core installation
pip install dvc
# With remote storage support (choose your provider)
pip install dvc[s3] # Amazon S3
pip install dvc[gs] # Google Cloud Storage
pip install dvc[azure] # Azure Blob Storage
pip install dvc[ssh] # SSH/SFTP
pip install dvc[all] # All backends
# Verify installation
dvc --version
# DVC 3.x.x
Once installed, initialize DVC inside your Git repository. The command creates the required directory structure and automatically updates .gitignore to exclude data files from Git tracking:
# Initialize Git (if not already done)
git init
git add .
git commit -m "Initial commit"
# Initialize DVC
dvc init
# Structure created by dvc init:
# .dvc/
# ├── config # DVC configuration (tracked by Git)
# ├── .gitignore # Excludes cache and tmp
# ├── cache/ # Local artifact cache (NOT tracked by Git)
# └── tmp/
# Commit DVC configuration files
git add .dvc/ .gitignore
git commit -m "chore: initialize DVC"
Tracking Datasets and Models
The core of DVC is the dvc add command, which works analogously to
git add but for large files. DVC computes the MD5 hash of the file, moves it
to the local cache, and creates a .dvc pointer file (a small text file tracked
by Git) that references the hash:
# Add a dataset to DVC tracking
dvc add data/raw/training_data.csv
# DVC has created:
# - data/raw/training_data.csv.dvc (pointer, tracked by Git)
# - data/raw/.gitignore (excludes the original file from Git)
# Contents of the generated .dvc file:
# outs:
# - md5: a1b2c3d4e5f6...
# size: 524288000
# path: training_data.csv
# Add the pointer to Git
git add data/raw/training_data.csv.dvc data/raw/.gitignore
git commit -m "feat: add training dataset v1.0"
git tag "dataset-v1.0"
# Add a trained model
dvc add models/churn_model.pkl
git add models/churn_model.pkl.dvc models/.gitignore
git commit -m "feat: add trained churn model v1 (accuracy=0.94)"
git tag "model-v1.0"
How the .dvc File Works
A .dvc file is a lightweight YAML file containing the MD5 (or SHA-256) hash of
the original file's contents, its size, and relative path. This hash is deterministic: the same
file always produces the same hash, making it possible to verify data integrity at any time.
Git tracks the .dvc file, while the actual data file lives in DVC's local cache
or in remote storage.
Daily Workflow with DVC
# Daily DVC workflow
# 1. Pull updated data from remote
git pull
dvc pull
# 2. Modify dataset or train a new model
python scripts/preprocess.py
python scripts/train.py
# 3. Track new artifacts
dvc add data/processed/features.parquet
dvc add models/churn_model_v2.pkl
# 4. Commit code and pointers
git add .
git commit -m "feat: retrain model with new features (F1=0.92)"
# 5. Push data to remote storage
dvc push
# 6. Push code to Git
git push
# ---- On a colleague's machine ----
git pull # Downloads code and .dvc files
dvc pull # Downloads data from remote storage
Remote Storage: S3, GCS, and Azure
Remote storage is the backbone of team collaboration in DVC. Without it, data exists only locally and versioning provides no team-level benefits. DVC supports all major cloud providers and on-premise solutions via SSH or NFS.
# ==================== AWS S3 ====================
# Prerequisites: pip install dvc[s3], AWS credentials configured
# Add S3 as the default remote
dvc remote add -d myremote s3://my-ml-bucket/dvc-storage
# For sensitive credentials, use local config (not tracked by Git)
dvc remote modify --local myremote access_key_id $AWS_ACCESS_KEY_ID
dvc remote modify --local myremote secret_access_key $AWS_SECRET_ACCESS_KEY
dvc remote modify myremote region eu-west-1
# ==================== Google Cloud Storage ====================
# Prerequisites: pip install dvc[gs], gcloud auth configured
dvc remote add -d gcsremote gs://my-ml-bucket/dvc-storage
# With Service Account (for production)
dvc remote modify gcsremote credentialpath /path/to/service-account.json
# ==================== Azure Blob Storage ====================
# Prerequisites: pip install dvc[azure]
dvc remote add -d azureremote azure://mycontainer/dvc-storage
dvc remote modify --local azureremote connection_string $AZURE_CONN_STRING
# ==================== Operations ====================
dvc remote list # Show remote configuration
dvc push # Push all tracked artifacts
dvc pull # Pull all artifacts
dvc status --cloud # Check synchronization status
Credential Security
Never place AWS, GCS, or Azure credentials in the .dvc/config file tracked by
Git. Always use --local to write to .dvc/config.local (automatically
git-ignored), or configure credentials via environment variables or IAM roles in CI/CD systems.
In a Kubernetes setup, use ServiceAccounts with IRSA (IAM Roles for Service Accounts) for
secure S3 access without hardcoded credentials.
Free Remote Storage: DagsHub
For teams with a budget under 5K EUR/year, DagsHub offers free DVC remote storage (up to 10 GB per public repo) with native MLflow integration:
# DagsHub as free remote storage
dvc remote add origin https://dagshub.com/username/my-ml-project.dvc
dvc remote modify origin --local auth basic
dvc remote modify origin --local user username
dvc remote modify origin --local password $DAGSHUB_TOKEN
dvc push # Push data
dvc pull # Pull data
DVC Pipelines: Reproducibility with dvc.yaml
Pipelines are DVC's most powerful feature: they define a complete ML workflow as a series of interconnected stages. DVC tracks dependencies between stages and re-runs only those whose inputs have changed - think of it as an intelligent Makefile for machine learning.
The pipeline is defined in a dvc.yaml file, and its state (hashes of all inputs
and outputs) is recorded in dvc.lock. Running dvc repro compares the
current state against dvc.lock and re-executes only invalidated stages.
# dvc.yaml - Complete ML pipeline for churn prediction
stages:
fetch_data:
cmd: python src/data/fetch_data.py
deps:
- src/data/fetch_data.py
params:
- data.source_url
- data.date_range
outs:
- data/raw/transactions.parquet
- data/raw/customers.csv
preprocess:
cmd: python src/features/preprocess.py
deps:
- src/features/preprocess.py
- data/raw/transactions.parquet
- data/raw/customers.csv
params:
- features.window_days
- features.aggregations
outs:
- data/processed/features.parquet
split:
cmd: python src/data/split.py
deps:
- src/data/split.py
- data/processed/features.parquet
params:
- split.train_ratio
- split.val_ratio
- split.random_seed
outs:
- data/splits/train.parquet
- data/splits/val.parquet
- data/splits/test.parquet
train:
cmd: python src/models/train.py
deps:
- src/models/train.py
- data/splits/train.parquet
- data/splits/val.parquet
params:
- model.type
- model.n_estimators
- model.max_depth
- model.learning_rate
outs:
- models/churn_model.pkl
metrics:
- metrics/train_metrics.json:
cache: false
evaluate:
cmd: python src/models/evaluate.py
deps:
- src/models/evaluate.py
- models/churn_model.pkl
- data/splits/test.parquet
metrics:
- metrics/test_metrics.json:
cache: false
# params.yaml - Centralized configuration
data:
source_url: "s3://my-data-bucket/raw/2025/"
date_range: "2024-01-01:2025-01-01"
features:
window_days: 30
aggregations: [mean, std, max]
split:
train_ratio: 0.7
val_ratio: 0.15
random_seed: 42
model:
type: "xgboost"
n_estimators: 500
max_depth: 6
learning_rate: 0.05
subsample: 0.8
# Pipeline execution and management
dvc repro # Run pipeline (invalidated stages only)
dvc repro --force # Force re-run all stages
dvc repro evaluate # Run up to a specific stage
dvc dag # Visualize pipeline DAG
dvc metrics show # Display current metrics
dvc metrics diff HEAD~1 # Compare with previous commit
dvc params diff main # Compare params with main branch
dvc diff # Show data diffs
DVC Python API: Programmatic Data Access
The DVC Python API enables direct access to versioned datasets and models from Python code without manually downloading files first. This is particularly useful in inference pipelines or serving systems where you want to dynamically load the correct model version.
import dvc.api
import pandas as pd
import pickle
import json
# ==================== dvc.api.open() ====================
# Accesses remote storage directly - no local download needed
with dvc.api.open(
"data/raw/customers.csv",
repo="https://github.com/myorg/ml-project",
rev="dataset-v2.1",
mode="r"
) as f:
df = pd.read_csv(f)
print(f"Dataset loaded: {len(df)} rows")
# Read a Parquet file (binary mode)
with dvc.api.open(
"data/processed/features.parquet",
rev="model-v3.0",
mode="rb"
) as f:
features_df = pd.read_parquet(f)
# ==================== dvc.api.get_url() ====================
# Gets the direct URL in remote storage for a specific version
url = dvc.api.get_url(
"models/churn_model.pkl",
repo="https://github.com/myorg/ml-project",
rev="model-v2.5"
)
print(f"Model URL: {url}")
# Output: s3://my-ml-bucket/dvc-storage/ab/cd1234...
# ==================== Versioned Model Loading ====================
def load_model(version: str):
"""Load a model from a specific DVC version."""
with dvc.api.open("models/churn_model.pkl", rev=version, mode="rb") as f:
model = pickle.load(f)
with dvc.api.open("models/feature_importance.json", rev=version, mode="r") as f:
metadata = json.load(f)
return model, metadata
model, metadata = load_model("model-v3.0")
print(f"Model loaded. Features: {metadata['n_features']}")
DVC + Git Integration: The Complete Workflow
Every Git commit that updates a .dvc or dvc.lock file represents
a fully reproducible snapshot of the entire ML project state: code, data, and models together.
# Feature branch workflow
# 1. Create experiment branch
git checkout -b experiment/xgboost-v2
# 2. Update params.yaml (e.g. learning_rate: 0.01)
# 3. Run pipeline
dvc repro
# 4. Check and compare metrics
dvc metrics show
dvc metrics diff main
# accuracy: 0.9301 -> 0.9423 (+0.0122)
# 5. Commit and push
git add dvc.lock params.yaml metrics/
git commit -m "experiment: XGBoost v2 - accuracy +1.2% (0.9423)"
dvc push
git push origin experiment/xgboost-v2
# 6. After PR merge, tag the production model
git checkout main
git tag -a "model-v2.0" -m "XGBoost v2: accuracy=0.9423"
git push --tags
# ==================== Rollback in Production ====================
# Model v2.0 has a bug - roll back to v1.0
git checkout model-v1.0 # Go back to that commit
dvc checkout # Restore data and models to that state
python src/models/evaluate.py # Verify restored model
DVC + MLflow: Linking Dataset Versions to Experiments
DVC and MLflow complement each other: DVC manages versioning of large binary artifacts, while MLflow tracks parameters, metrics, and experiment runs. Integrating them gives you a complete audit trail linking every MLflow run to a specific dataset version.
import mlflow
import mlflow.sklearn
import dvc.api
import subprocess
import pandas as pd
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from xgboost import XGBClassifier
def get_dvc_metadata() -> dict:
"""Collect DVC metadata for the current MLflow run."""
git_rev = subprocess.run(
["git", "rev-parse", "HEAD"], capture_output=True, text=True
).stdout.strip()
data_url = dvc.api.get_url("data/splits/train.parquet")
return {
"dvc.git_rev": git_rev,
"dvc.data_url": data_url,
"dvc.data_version": "v2.1",
}
def train_with_tracking(params: dict) -> None:
"""Full training run with DVC + MLflow tracking."""
mlflow.set_experiment("churn-prediction")
with mlflow.start_run(run_name="xgboost-dvc-integrated") as run:
# Link MLflow run to data version
mlflow.log_params(get_dvc_metadata())
mlflow.log_params(params)
# Load versioned data
train_df = pd.read_parquet("data/splits/train.parquet")
test_df = pd.read_parquet("data/splits/test.parquet")
X_train = train_df.drop("churn", axis=1)
y_train = train_df["churn"]
X_test = test_df.drop("churn", axis=1)
y_test = test_df["churn"]
model = XGBClassifier(**params)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
metrics = {
"accuracy": accuracy_score(y_test, y_pred),
"f1_score": f1_score(y_test, y_pred),
"auc_roc": roc_auc_score(y_test, y_prob),
}
mlflow.log_metrics(metrics)
mlflow.sklearn.log_model(model, "model")
print(f"Run {run.info.run_id} | Accuracy: {metrics['accuracy']:.4f}")
DVC vs lakeFS: Choosing the Right Tool
In November 2025, lakeFS acquired DVC, but the two tools remain distinct with different primary use cases. The right choice depends on data scale, infrastructure complexity, and team requirements. They are not mutually exclusive: enterprise teams often run both.
Architectural Comparison: DVC vs lakeFS
| Dimension | DVC | lakeFS |
|---|---|---|
| Architecture | Client-only, no server required | Client/server, requires lakeFS server |
| Data scale | Datasets up to ~TB | Petabyte-scale, enterprise data lakes |
| Integration | Native Git workflow | S3-compatible API, Spark, Hive, Athena |
| Data branching | Through Git commits + .dvc files | Native branches at object store level |
| Setup time | Minutes (pip install + dvc init) | Hours to days (Docker/Kubernetes) |
| Target users | Data scientists, small ML teams | Data engineering teams, enterprise |
| Cost | Free (pay only for storage) | Open-source + paid enterprise plan |
# lakeFS: Python SDK workflow example
from lakefs_sdk import Client
client = Client(
host="https://lakefs.mycompany.com",
username="access_key",
password="secret_key"
)
repo = client.repositories.create_repository(
"churn-data-lake",
storage_namespace="s3://my-data-lake/churn/",
default_branch="main"
)
# Create experiment branch (zero-copy - milliseconds at any scale)
experiment_branch = repo.branch("experiment/new-features").create(
source_reference="main"
)
# Upload new data without impacting main branch
experiment_branch.object("data/features_v2.parquet").upload(
content=new_features_df.to_parquet()
)
experiment_branch.commit(
message="feat: add recency features",
metadata={"model_version": "v3.0"}
)
# Merge after successful validation
repo.branch("main").merge(source_ref="experiment/new-features")
# Instant rollback if needed
repo.branch("main").revert(reference="main~1")
Which Tool Should You Choose?
Use DVC when:
- You are a data scientist or a small team (1-10 people)
- Your datasets are in the GB to low-TB range
- You want native Git integration without additional infrastructure
- Budget is under 5K EUR/year
Use lakeFS when:
- You manage a petabyte-scale enterprise data lake
- You use Spark, Athena, Presto, or other big data frameworks
- You need data governance and audit trails for GDPR or EU AI Act compliance
- You already have Kubernetes infrastructure for server deployment
Production Best Practices for ML Versioning
Recommended Repository Structure
ml-project/
├── .dvc/
│ ├── config # Remote config (no secrets)
│ └── .gitignore
├── data/
│ ├── raw/ # .dvc pointers + .gitignore
│ ├── processed/ # .dvc pointers
│ └── splits/ # .dvc pointers
├── models/ # .dvc pointers + .gitignore
├── metrics/ # JSON files tracked by Git
├── reports/ # PNG reports (cache: false)
├── src/
│ ├── data/
│ ├── features/
│ └── models/
├── dvc.yaml # Pipeline definition
├── dvc.lock # Pipeline state (ALWAYS commit to Git)
├── params.yaml # Hyperparameters
└── requirements.txt
CI/CD with GitHub Actions
# .github/workflows/ml-pipeline.yml
name: ML Pipeline Validation
on:
pull_request:
branches: [main]
paths: ['src/**', 'params.yaml', 'dvc.yaml']
jobs:
run-pipeline:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- run: pip install dvc[s3] -r requirements.txt
- name: Configure DVC remote
env:
AWS_ACCESS_KEY_ID: 






