Hi! I'm

Federico Calò

Software Developer | Technical Writer

I create modern web applications and custom digital tools to help businesses grow through technological innovation. My passion is combining computer science and economics to generate real value.

Contact Me

About Me

My passion for computer science was born at the Technical Commercial Institute of Maglie, where I discovered the power of programming and the fascination of creating digital solutions. From the start, I understood that computer science was not just code, but an extraordinary tool for turning ideas into reality.

During my studies in Business Information Systems, I began to interweave computer science and economics, understanding how technology can be the engine of growth for any business. This vision accompanied me to the University of Bari, where I obtained my degree in Computer Science, deepening my technical skills and passion for software development.

Today I put this experience at the service of businesses, professionals and startups, creating tailor-made digital solutions that automate processes, optimize resources and open new business opportunities. Because true innovation begins when technology meets the real needs of people.

My Skills

Data Analysis & Predictive Models

I transform data into strategic insights with in-depth analysis and predictive models for informed decisions

Process Automation

I create custom tools that automate repetitive operations and free up time for value-added activities

Custom Systems

I develop tailor-made software systems, from platform integrations to customized dashboards

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Che tu gestisca un negozio, uno studio professionale o un'azienda, posso aiutarti a sfruttare le potenzialità dell'informatica per lavorare meglio, più velocemente e in modo più intelligente.

Parliamone Insieme →

Join the Community

Join the developer community where we discuss software, AI, architecture and DevOps. Share ideas, ask questions and grow with us.

Channel

FC Dev Blog

Get notifications on new articles, complete series, weekly tips and featured tools. Bilingual IT/EN content directly in your Telegram.

New articles as they are published
Weekly tips and code snippets
Polls on future topics

Subscribe to Channel

Group

FC Dev Community

A bilingual IT/EN community for developers. Discussions, Q&A, mutual help and networking with other professionals.

Discussions on articles and technologies
Coding help and code review
Job opportunities and collaboration

Join the Group

Discussion Topics

View

Master SQL

RoadMap.sh

November 2024

View

Oracle Certified Foundations Associate

Oracle

October 2024

View

People Leadership Credential

Connect

September 2024

💻 Languages & Technologies

Java

Python

JavaScript

Angular

React

TypeScript

SQL

PHP

CSS/SCSS

Node.js

Docker

Git

💼

12/2024 - Present

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italy · Hybrid Analysis and development of computer systems through the use of Java and Quarkus in Health and Public Sector. Continuous training on modern technologies for creating customized and efficient software solutions and on agents.

💼

06/2022 - 12/2024

Software analyst and Back End Developer Associate Consultant

Links Management and Technology SpA

Experience analyzing as-is software systems and ETL flows using PowerCenter. Completed Spring Boot training for developing modern and scalable backend applications. Backend developer specialized in Spring Boot, with experience in database design, analysis, development and testing of assigned tasks.

💼

02/2021 - 10/2021

Software programmer

Adesso.it (prima era WebScience srl)

Experience in AS-IS and TO-BE analysis, SEO evolutions and website evolutions to improve user performance and engagement.

🎓

2018 - 2025

Degree in Computer Science

University of Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Corporate Information Systems

Technical Commercial Institute of Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

Dataset and Model Versioning with DVC: A Production MLOps Guide

Picture this: you train a model that hits 94% accuracy, deploy it to production, and then three months later nobody can tell you exactly which dataset version was used, which hyperparameters were active, or how to reproduce the result. The model is live, but it is completely irreproducible. This scenario is the norm, not the exception, in ML teams that have not adopted structured versioning practices.

Versioning in MLOps goes beyond source code: it covers datasets, trained models, pipelines, and configurations. Git handles source code beautifully, but it was never designed for binary files weighing hundreds of gigabytes. That is precisely where DVC (Data Version Control) comes in - the open-source tool that brings Git-like versioning principles to the world of data and machine learning models.

In November 2025, lakeFS acquired DVC, confirming the tool's centrality in the MLOps ecosystem while committing to keeping it 100% open-source. This article gives you a deep dive into DVC: setup, pipelines, remote storage, Git and MLflow integration, a structural comparison with lakeFS, and battle-tested best practices for teams of any size.

What You Will Learn

Why versioning is non-negotiable in MLOps and how it differs from code versioning
Complete DVC setup: initialization, data tracking, .dvc files, and automatic .gitignore management
DVC pipelines with dvc.yaml: stages, deps, outs, params, and the dvc.lock file
Remote storage configuration: AWS S3, Google Cloud Storage, and Azure Blob Storage
DVC Python API: dvc.api.open() and dvc.api.get_url() for programmatic access
DVC + MLflow integration: linking dataset versions to experiment runs
Architectural comparison of DVC vs lakeFS: which tool fits which scenario
Best practices and cost-effective stack for teams with a budget under 5K EUR/year

The MLOps and Machine Learning in Production Series

#	Article	Focus
1	MLOps 101: From Experiment to Production	Foundations and lifecycle
2	ML Pipelines with CI/CD	GitHub Actions and Docker for ML
3	You are here - Dataset and Model Versioning	DVC vs lakeFS
4	Experiment Tracking with MLflow	Tracking, model registry, comparison
5	Model Drift Detection	Monitoring and automated retraining
6	Serving with FastAPI + Uvicorn	Production model deployment
7	Scaling ML on Kubernetes	KubeFlow and Seldon Core
8	A/B Testing ML Models	Methodology and implementation
9	ML Governance	Compliance, EU AI Act, ethics
10	Case Study: Churn Prediction	End-to-end production pipeline

The Versioning Problem in Machine Learning

In a traditional software project, Git solves virtually all versioning challenges: code is lightweight, text-based, and naturally suited to diff and merge operations. In an ML project, code is only one piece of the puzzle. Reproducing a model requires simultaneously tracking:

Training and validation datasets: often gigabytes or terabytes of structured or unstructured data
Model artifacts: .pkl, .pt, .h5 files that can weigh hundreds of megabytes
Hyperparameters and configurations: learning rate, architecture choices, preprocessing steps
Metrics and results: accuracy, F1, AUC-ROC for every experiment run
Environment dependencies: Python version, library pinning, CUDA drivers

Git is not designed for large binary files: pushing a 500 MB model file once would cause every subsequent clone to download the entire history, making the repository unmanageable within months. The solution is not to abandon versioning, but to use the right tool for each artifact type.

      The Cost of Missing ML Versioning
      
            Scenario
            Cost Without Versioning
            DVC Solution
          
            Degraded production model
            No rollback to the previous model
            git checkout + dvc checkout
          
            Corrupted dataset in pipeline
            Full re-download and processing
            dvc checkout to previous dataset version
          
            Regulatory audit (EU AI Act)
            Cannot prove which data trained the model
            Dataset hash tracked in .dvc file
          
            Team collaboration
            Each data scientist uses different data versions
            dvc pull synchronizes the entire team
          
            Experiment reproduction
            Results not replicable after 6 months
            git checkout + dvc repro rebuilds everything

DVC Setup and Configuration

DVC integrates into an existing Git repository as an additional layer. It requires no central server, works locally, and scales naturally to cloud storage as the team grows. Installation is straightforward via pip, with optional extras for each storage backend:

# Core installation
pip install dvc

# With remote storage support (choose your provider)
pip install dvc[s3]       # Amazon S3
pip install dvc[gs]       # Google Cloud Storage
pip install dvc[azure]    # Azure Blob Storage
pip install dvc[ssh]      # SSH/SFTP
pip install dvc[all]      # All backends

# Verify installation
dvc --version
# DVC 3.x.x

Once installed, initialize DVC inside your Git repository. The command creates the required directory structure and automatically updates .gitignore to exclude data files from Git tracking:

# Initialize Git (if not already done)
git init
git add .
git commit -m "Initial commit"

# Initialize DVC
dvc init

# Structure created by dvc init:
# .dvc/
# ├── config          # DVC configuration (tracked by Git)
# ├── .gitignore      # Excludes cache and tmp
# ├── cache/          # Local artifact cache (NOT tracked by Git)
# └── tmp/

# Commit DVC configuration files
git add .dvc/ .gitignore
git commit -m "chore: initialize DVC"

Tracking Datasets and Models

The core of DVC is the dvc add command, which works analogously to git add but for large files. DVC computes the MD5 hash of the file, moves it to the local cache, and creates a .dvc pointer file (a small text file tracked by Git) that references the hash:

# Add a dataset to DVC tracking
dvc add data/raw/training_data.csv

# DVC has created:
# - data/raw/training_data.csv.dvc  (pointer, tracked by Git)
# - data/raw/.gitignore             (excludes the original file from Git)

# Contents of the generated .dvc file:
# outs:
# - md5: a1b2c3d4e5f6...
#   size: 524288000
#   path: training_data.csv

# Add the pointer to Git
git add data/raw/training_data.csv.dvc data/raw/.gitignore
git commit -m "feat: add training dataset v1.0"
git tag "dataset-v1.0"

# Add a trained model
dvc add models/churn_model.pkl
git add models/churn_model.pkl.dvc models/.gitignore
git commit -m "feat: add trained churn model v1 (accuracy=0.94)"
git tag "model-v1.0"

How the .dvc File Works

A .dvc file is a lightweight YAML file containing the MD5 (or SHA-256) hash of the original file's contents, its size, and relative path. This hash is deterministic: the same file always produces the same hash, making it possible to verify data integrity at any time. Git tracks the .dvc file, while the actual data file lives in DVC's local cache or in remote storage.

Daily Workflow with DVC

# Daily DVC workflow

# 1. Pull updated data from remote
git pull
dvc pull

# 2. Modify dataset or train a new model
python scripts/preprocess.py
python scripts/train.py

# 3. Track new artifacts
dvc add data/processed/features.parquet
dvc add models/churn_model_v2.pkl

# 4. Commit code and pointers
git add .
git commit -m "feat: retrain model with new features (F1=0.92)"

# 5. Push data to remote storage
dvc push

# 6. Push code to Git
git push

# ---- On a colleague's machine ----
git pull          # Downloads code and .dvc files
dvc pull          # Downloads data from remote storage

Remote Storage: S3, GCS, and Azure

Remote storage is the backbone of team collaboration in DVC. Without it, data exists only locally and versioning provides no team-level benefits. DVC supports all major cloud providers and on-premise solutions via SSH or NFS.

# ==================== AWS S3 ====================
# Prerequisites: pip install dvc[s3], AWS credentials configured

# Add S3 as the default remote
dvc remote add -d myremote s3://my-ml-bucket/dvc-storage

# For sensitive credentials, use local config (not tracked by Git)
dvc remote modify --local myremote access_key_id $AWS_ACCESS_KEY_ID
dvc remote modify --local myremote secret_access_key $AWS_SECRET_ACCESS_KEY
dvc remote modify myremote region eu-west-1


# ==================== Google Cloud Storage ====================
# Prerequisites: pip install dvc[gs], gcloud auth configured

dvc remote add -d gcsremote gs://my-ml-bucket/dvc-storage

# With Service Account (for production)
dvc remote modify gcsremote credentialpath /path/to/service-account.json


# ==================== Azure Blob Storage ====================
# Prerequisites: pip install dvc[azure]

dvc remote add -d azureremote azure://mycontainer/dvc-storage
dvc remote modify --local azureremote connection_string $AZURE_CONN_STRING


# ==================== Operations ====================
dvc remote list         # Show remote configuration
dvc push                # Push all tracked artifacts
dvc pull                # Pull all artifacts
dvc status --cloud      # Check synchronization status

Credential Security

Never place AWS, GCS, or Azure credentials in the .dvc/config file tracked by Git. Always use --local to write to .dvc/config.local (automatically git-ignored), or configure credentials via environment variables or IAM roles in CI/CD systems. In a Kubernetes setup, use ServiceAccounts with IRSA (IAM Roles for Service Accounts) for secure S3 access without hardcoded credentials.

Free Remote Storage: DagsHub

For teams with a budget under 5K EUR/year, DagsHub offers free DVC remote storage (up to 10 GB per public repo) with native MLflow integration:

# DagsHub as free remote storage
dvc remote add origin https://dagshub.com/username/my-ml-project.dvc
dvc remote modify origin --local auth basic
dvc remote modify origin --local user username
dvc remote modify origin --local password $DAGSHUB_TOKEN

dvc push   # Push data
dvc pull   # Pull data

DVC Pipelines: Reproducibility with dvc.yaml

Pipelines are DVC's most powerful feature: they define a complete ML workflow as a series of interconnected stages. DVC tracks dependencies between stages and re-runs only those whose inputs have changed - think of it as an intelligent Makefile for machine learning.

The pipeline is defined in a dvc.yaml file, and its state (hashes of all inputs and outputs) is recorded in dvc.lock. Running dvc repro compares the current state against dvc.lock and re-executes only invalidated stages.

# dvc.yaml - Complete ML pipeline for churn prediction
stages:

  fetch_data:
    cmd: python src/data/fetch_data.py
    deps:
      - src/data/fetch_data.py
    params:
      - data.source_url
      - data.date_range
    outs:
      - data/raw/transactions.parquet
      - data/raw/customers.csv

  preprocess:
    cmd: python src/features/preprocess.py
    deps:
      - src/features/preprocess.py
      - data/raw/transactions.parquet
      - data/raw/customers.csv
    params:
      - features.window_days
      - features.aggregations
    outs:
      - data/processed/features.parquet

  split:
    cmd: python src/data/split.py
    deps:
      - src/data/split.py
      - data/processed/features.parquet
    params:
      - split.train_ratio
      - split.val_ratio
      - split.random_seed
    outs:
      - data/splits/train.parquet
      - data/splits/val.parquet
      - data/splits/test.parquet

  train:
    cmd: python src/models/train.py
    deps:
      - src/models/train.py
      - data/splits/train.parquet
      - data/splits/val.parquet
    params:
      - model.type
      - model.n_estimators
      - model.max_depth
      - model.learning_rate
    outs:
      - models/churn_model.pkl
    metrics:
      - metrics/train_metrics.json:
          cache: false

  evaluate:
    cmd: python src/models/evaluate.py
    deps:
      - src/models/evaluate.py
      - models/churn_model.pkl
      - data/splits/test.parquet
    metrics:
      - metrics/test_metrics.json:
          cache: false

# params.yaml - Centralized configuration
data:
  source_url: "s3://my-data-bucket/raw/2025/"
  date_range: "2024-01-01:2025-01-01"

features:
  window_days: 30
  aggregations: [mean, std, max]

split:
  train_ratio: 0.7
  val_ratio: 0.15
  random_seed: 42

model:
  type: "xgboost"
  n_estimators: 500
  max_depth: 6
  learning_rate: 0.05
  subsample: 0.8

# Pipeline execution and management

dvc repro               # Run pipeline (invalidated stages only)
dvc repro --force       # Force re-run all stages
dvc repro evaluate      # Run up to a specific stage

dvc dag                 # Visualize pipeline DAG

dvc metrics show        # Display current metrics
dvc metrics diff HEAD~1 # Compare with previous commit

dvc params diff main    # Compare params with main branch
dvc diff                # Show data diffs

DVC Python API: Programmatic Data Access

The DVC Python API enables direct access to versioned datasets and models from Python code without manually downloading files first. This is particularly useful in inference pipelines or serving systems where you want to dynamically load the correct model version.

import dvc.api
import pandas as pd
import pickle
import json


# ==================== dvc.api.open() ====================
# Accesses remote storage directly - no local download needed

with dvc.api.open(
    "data/raw/customers.csv",
    repo="https://github.com/myorg/ml-project",
    rev="dataset-v2.1",
    mode="r"
) as f:
    df = pd.read_csv(f)

print(f"Dataset loaded: {len(df)} rows")

# Read a Parquet file (binary mode)
with dvc.api.open(
    "data/processed/features.parquet",
    rev="model-v3.0",
    mode="rb"
) as f:
    features_df = pd.read_parquet(f)


# ==================== dvc.api.get_url() ====================
# Gets the direct URL in remote storage for a specific version

url = dvc.api.get_url(
    "models/churn_model.pkl",
    repo="https://github.com/myorg/ml-project",
    rev="model-v2.5"
)
print(f"Model URL: {url}")
# Output: s3://my-ml-bucket/dvc-storage/ab/cd1234...


# ==================== Versioned Model Loading ====================
def load_model(version: str):
    """Load a model from a specific DVC version."""
    with dvc.api.open("models/churn_model.pkl", rev=version, mode="rb") as f:
        model = pickle.load(f)
    with dvc.api.open("models/feature_importance.json", rev=version, mode="r") as f:
        metadata = json.load(f)
    return model, metadata


model, metadata = load_model("model-v3.0")
print(f"Model loaded. Features: {metadata['n_features']}")

DVC + Git Integration: The Complete Workflow

Every Git commit that updates a .dvc or dvc.lock file represents a fully reproducible snapshot of the entire ML project state: code, data, and models together.

# Feature branch workflow

# 1. Create experiment branch
git checkout -b experiment/xgboost-v2

# 2. Update params.yaml (e.g. learning_rate: 0.01)

# 3. Run pipeline
dvc repro

# 4. Check and compare metrics
dvc metrics show
dvc metrics diff main
# accuracy: 0.9301 -> 0.9423  (+0.0122)

# 5. Commit and push
git add dvc.lock params.yaml metrics/
git commit -m "experiment: XGBoost v2 - accuracy +1.2% (0.9423)"
dvc push
git push origin experiment/xgboost-v2

# 6. After PR merge, tag the production model
git checkout main
git tag -a "model-v2.0" -m "XGBoost v2: accuracy=0.9423"
git push --tags


# ==================== Rollback in Production ====================
# Model v2.0 has a bug - roll back to v1.0

git checkout model-v1.0   # Go back to that commit
dvc checkout              # Restore data and models to that state
python src/models/evaluate.py  # Verify restored model

DVC + MLflow: Linking Dataset Versions to Experiments

DVC and MLflow complement each other: DVC manages versioning of large binary artifacts, while MLflow tracks parameters, metrics, and experiment runs. Integrating them gives you a complete audit trail linking every MLflow run to a specific dataset version.

import mlflow
import mlflow.sklearn
import dvc.api
import subprocess
import pandas as pd
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from xgboost import XGBClassifier


def get_dvc_metadata() -> dict:
    """Collect DVC metadata for the current MLflow run."""
    git_rev = subprocess.run(
        ["git", "rev-parse", "HEAD"], capture_output=True, text=True
    ).stdout.strip()

    data_url = dvc.api.get_url("data/splits/train.parquet")

    return {
        "dvc.git_rev": git_rev,
        "dvc.data_url": data_url,
        "dvc.data_version": "v2.1",
    }


def train_with_tracking(params: dict) -> None:
    """Full training run with DVC + MLflow tracking."""
    mlflow.set_experiment("churn-prediction")

    with mlflow.start_run(run_name="xgboost-dvc-integrated") as run:
        # Link MLflow run to data version
        mlflow.log_params(get_dvc_metadata())
        mlflow.log_params(params)

        # Load versioned data
        train_df = pd.read_parquet("data/splits/train.parquet")
        test_df = pd.read_parquet("data/splits/test.parquet")

        X_train = train_df.drop("churn", axis=1)
        y_train = train_df["churn"]
        X_test = test_df.drop("churn", axis=1)
        y_test = test_df["churn"]

        model = XGBClassifier(**params)
        model.fit(X_train, y_train)

        y_pred = model.predict(X_test)
        y_prob = model.predict_proba(X_test)[:, 1]

        metrics = {
            "accuracy": accuracy_score(y_test, y_pred),
            "f1_score": f1_score(y_test, y_pred),
            "auc_roc": roc_auc_score(y_test, y_prob),
        }

        mlflow.log_metrics(metrics)
        mlflow.sklearn.log_model(model, "model")
        print(f"Run {run.info.run_id} | Accuracy: {metrics['accuracy']:.4f}")

DVC vs lakeFS: Choosing the Right Tool

In November 2025, lakeFS acquired DVC, but the two tools remain distinct with different primary use cases. The right choice depends on data scale, infrastructure complexity, and team requirements. They are not mutually exclusive: enterprise teams often run both.

      Architectural Comparison: DVC vs lakeFS
      
            Dimension
            DVC
            lakeFS
          
            Architecture
            Client-only, no server required
            Client/server, requires lakeFS server
          
            Data scale
            Datasets up to ~TB
            Petabyte-scale, enterprise data lakes
          
            Integration
            Native Git workflow
            S3-compatible API, Spark, Hive, Athena
          
            Data branching
            Through Git commits + .dvc files
            Native branches at object store level
          
            Setup time
            Minutes (pip install + dvc init)
            Hours to days (Docker/Kubernetes)
          
            Target users
            Data scientists, small ML teams
            Data engineering teams, enterprise
          
            Cost
            Free (pay only for storage)
            Open-source + paid enterprise plan

# lakeFS: Python SDK workflow example
from lakefs_sdk import Client

client = Client(
    host="https://lakefs.mycompany.com",
    username="access_key",
    password="secret_key"
)

repo = client.repositories.create_repository(
    "churn-data-lake",
    storage_namespace="s3://my-data-lake/churn/",
    default_branch="main"
)

# Create experiment branch (zero-copy - milliseconds at any scale)
experiment_branch = repo.branch("experiment/new-features").create(
    source_reference="main"
)

# Upload new data without impacting main branch
experiment_branch.object("data/features_v2.parquet").upload(
    content=new_features_df.to_parquet()
)

experiment_branch.commit(
    message="feat: add recency features",
    metadata={"model_version": "v3.0"}
)

# Merge after successful validation
repo.branch("main").merge(source_ref="experiment/new-features")

# Instant rollback if needed
repo.branch("main").revert(reference="main~1")

Which Tool Should You Choose?

Use DVC when:

You are a data scientist or a small team (1-10 people)
Your datasets are in the GB to low-TB range
You want native Git integration without additional infrastructure
Budget is under 5K EUR/year

Use lakeFS when:

You manage a petabyte-scale enterprise data lake
You use Spark, Athena, Presto, or other big data frameworks
You need data governance and audit trails for GDPR or EU AI Act compliance
You already have Kubernetes infrastructure for server deployment

Production Best Practices for ML Versioning

Recommended Repository Structure

ml-project/
├── .dvc/
│   ├── config              # Remote config (no secrets)
│   └── .gitignore
├── data/
│   ├── raw/                # .dvc pointers + .gitignore
│   ├── processed/          # .dvc pointers
│   └── splits/             # .dvc pointers
├── models/                 # .dvc pointers + .gitignore
├── metrics/                # JSON files tracked by Git
├── reports/                # PNG reports (cache: false)
├── src/
│   ├── data/
│   ├── features/
│   └── models/
├── dvc.yaml                # Pipeline definition
├── dvc.lock                # Pipeline state (ALWAYS commit to Git)
├── params.yaml             # Hyperparameters
└── requirements.txt

CI/CD with GitHub Actions

# .github/workflows/ml-pipeline.yml
name: ML Pipeline Validation

on:
  pull_request:
    branches: [main]
    paths: ['src/**', 'params.yaml', 'dvc.yaml']

jobs:
  run-pipeline:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - run: pip install dvc[s3] -r requirements.txt

      - name: Configure DVC remote
        env:
          AWS_ACCESS_KEY_ID: #123;{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: #123;{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc remote modify --local myremote \
            access_key_id $AWS_ACCESS_KEY_ID
          dvc remote modify --local myremote \
            secret_access_key $AWS_SECRET_ACCESS_KEY

      - run: dvc pull data/splits/
      - run: dvc repro

      - name: Check metrics regression
        run: |
          dvc metrics diff main --md >> $GITHUB_STEP_SUMMARY
          python scripts/check_metrics.py --min-accuracy 0.90

      - run: dvc push

Anti-Patterns to Avoid

Never commit large data files directly to Git: a single 200 MB .pkl committed once will slow every clone forever
Do not use dvc add on pipeline-generated files: outputs must be defined as outs in dvc.yaml
Never git-ignore dvc.lock: this file is fundamental for reproducibility
Do not forget dvc push after training: teammates running dvc pull will not find the new artifacts
Avoid absolute paths in dvc.yaml: always use relative paths for portability

Conclusions and Next Steps

Dataset and model versioning is a fundamental requirement for any serious ML project. DVC addresses the problem elegantly: it integrates seamlessly with Git, supports all major cloud providers, and has a gentle learning curve for anyone comfortable with Git. With the lakeFS acquisition in 2025, the ecosystem has become even more robust.

For startups or small teams, DVC + Git + S3 (or DagsHub) delivers a professional-grade versioning system for under 250 EUR/year. For enterprises with petabyte-scale data lakes, lakeFS provides enterprise governance features while maintaining S3 compatibility.

The next article in this series covers Experiment Tracking with MLflow: parameter and metric logging, the Model Registry lifecycle, and combining MLflow with DVC for a complete, reproducible MLOps system.

Resources and Next Steps

Official DVC documentation: dvc.org/doc
lakeFS documentation: docs.lakefs.io
DagsHub free hosting: dagshub.com
Next: Experiment Tracking with MLflow - Complete Guide
Related: Deep Learning Advanced - Advanced Training
Related: Computer Vision - Object Detection Pipeline

      Recommended Versioning Stack (Budget under 5K EUR/year)
      
            Component
            Tool
            Estimated Annual Cost
          
            Data and model versioning
            DVC (open-source)
            Free
          
            Remote storage (up to 100 GB)
            DagsHub or AWS S3
            Free / ~28 EUR/year
          
            Experiment tracking
            MLflow self-hosted
            Free (VM costs only)
          
            CI/CD pipeline
            GitHub Actions
            Free (2000 min/month)
          
            VM for MLflow server
            EC2 t3.small or equivalent
            ~180 EUR/year
          
            Estimated total
            
            Under 250 EUR/year

Scenario	Cost Without Versioning	DVC Solution
Degraded production model	No rollback to the previous model	git checkout + dvc checkout
Corrupted dataset in pipeline	Full re-download and processing	dvc checkout to previous dataset version
Regulatory audit (EU AI Act)	Cannot prove which data trained the model	Dataset hash tracked in .dvc file
Team collaboration	Each data scientist uses different data versions	dvc pull synchronizes the entire team
Experiment reproduction	Results not replicable after 6 months	git checkout + dvc repro rebuilds everything

Dimension	DVC	lakeFS
Architecture	Client-only, no server required	Client/server, requires lakeFS server
Data scale	Datasets up to ~TB	Petabyte-scale, enterprise data lakes
Integration	Native Git workflow	S3-compatible API, Spark, Hive, Athena
Data branching	Through Git commits + .dvc files	Native branches at object store level
Setup time	Minutes (pip install + dvc init)	Hours to days (Docker/Kubernetes)
Target users	Data scientists, small ML teams	Data engineering teams, enterprise
Cost	Free (pay only for storage)	Open-source + paid enterprise plan

Component	Tool	Estimated Annual Cost
Data and model versioning	DVC (open-source)	Free
Remote storage (up to 100 GB)	DagsHub or AWS S3	Free / ~28 EUR/year
Experiment tracking	MLflow self-hosted	Free (VM costs only)
CI/CD pipeline	GitHub Actions	Free (2000 min/month)
VM for MLflow server	EC2 t3.small or equivalent	~180 EUR/year
Estimated total		Under 250 EUR/year