Hi! I'm

Federico Calò

Software Developer | Technical Writer

I create modern web applications and custom digital tools to help businesses grow through technological innovation. My passion is combining computer science and economics to generate real value.

Contact Me

About Me

My passion for computer science was born at the Technical Commercial Institute of Maglie, where I discovered the power of programming and the fascination of creating digital solutions. From the start, I understood that computer science was not just code, but an extraordinary tool for turning ideas into reality.

During my studies in Business Information Systems, I began to interweave computer science and economics, understanding how technology can be the engine of growth for any business. This vision accompanied me to the University of Bari, where I obtained my degree in Computer Science, deepening my technical skills and passion for software development.

Today I put this experience at the service of businesses, professionals and startups, creating tailor-made digital solutions that automate processes, optimize resources and open new business opportunities. Because true innovation begins when technology meets the real needs of people.

My Skills

Data Analysis & Predictive Models

I transform data into strategic insights with in-depth analysis and predictive models for informed decisions

Process Automation

I create custom tools that automate repetitive operations and free up time for value-added activities

Custom Systems

I develop tailor-made software systems, from platform integrations to customized dashboards

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Che tu gestisca un negozio, uno studio professionale o un'azienda, posso aiutarti a sfruttare le potenzialità dell'informatica per lavorare meglio, più velocemente e in modo più intelligente.

Parliamone Insieme →

Join the Community

Join the developer community where we discuss software, AI, architecture and DevOps. Share ideas, ask questions and grow with us.

Channel

FC Dev Blog

Get notifications on new articles, complete series, weekly tips and featured tools. Bilingual IT/EN content directly in your Telegram.

New articles as they are published
Weekly tips and code snippets
Polls on future topics

Subscribe to Channel

Group

FC Dev Community

A bilingual IT/EN community for developers. Discussions, Q&A, mutual help and networking with other professionals.

Discussions on articles and technologies
Coding help and code review
Job opportunities and collaboration

Join the Group

Discussion Topics

View

Master SQL

RoadMap.sh

November 2024

View

Oracle Certified Foundations Associate

Oracle

October 2024

View

People Leadership Credential

Connect

September 2024

💻 Languages & Technologies

Java

Python

JavaScript

Angular

React

TypeScript

SQL

PHP

CSS/SCSS

Node.js

Docker

Git

💼

12/2024 - Present

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italy · Hybrid Analysis and development of computer systems through the use of Java and Quarkus in Health and Public Sector. Continuous training on modern technologies for creating customized and efficient software solutions and on agents.

💼

06/2022 - 12/2024

Software analyst and Back End Developer Associate Consultant

Links Management and Technology SpA

Experience analyzing as-is software systems and ETL flows using PowerCenter. Completed Spring Boot training for developing modern and scalable backend applications. Backend developer specialized in Spring Boot, with experience in database design, analysis, development and testing of assigned tasks.

💼

02/2021 - 10/2021

Software programmer

Adesso.it (prima era WebScience srl)

Experience in AS-IS and TO-BE analysis, SEO evolutions and website evolutions to improve user performance and engagement.

🎓

2018 - 2025

Degree in Computer Science

University of Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Corporate Information Systems

Technical Commercial Institute of Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

Fine-tuning NLP Models Locally: Adapting BERT to Your Domain

Pre-trained models like BERT are extremely powerful, but they are trained on generic data. For real applications — legal contract analysis, medical record classification, sentiment on domain-specific reviews, NER on technical texts — domain-specific fine-tuning makes the difference between a mediocre model and an excellent one.

In this article we explore all the techniques for adapting BERT (and LLM models) to your domain: from domain-adaptive pre-training to LoRA fine-tuning on consumer GPUs, from managing annotated data to strategies for maximizing quality with few examples. We include practical code examples and real-world use cases.

This is the eighth article in the Modern NLP: from BERT to LLMs series, classified as Advanced. It assumes familiarity with BERT and the HuggingFace ecosystem (articles 2 and 7).

What You Will Learn

Fine-tuning strategies: from scratch, partial, full, adapter — systematic comparison
Domain-Adaptive Pre-training (DAPT) for domain adaptation
LoRA mathematics: low-rank decomposition and geometric intuition
Practical LoRA: implementation with PEFT library for classification
QLoRA: LoRA with 4-bit quantization on consumer GPU (8-16GB)
Fine-tuning LLMs (LLaMA, Mistral) with TRL and SFTTrainer
Managing small datasets (<1000 examples): techniques to maximize performance
NLP data augmentation: back-translation, EDA, synonym replacement
Techniques to avoid catastrophic forgetting (EWC, gradual unfreezing)
Post fine-tuning evaluation: domain-specific benchmarks and error analysis
Model versioning and deployment management

1. Fine-tuning Strategies: A Comparison

There is no single optimal fine-tuning strategy. The choice depends on computational resources, available data quantity, base model size, and performance requirements. The table below provides a practical decision framework.

Fine-tuning Approaches: Decision Guide

Strategy	Trained Parameters	GPU Required	Data Needed	Pros	Cons
Full fine-tuning	100% (all)	16-80GB	10K+	Maximum accuracy, highest adaptability	Expensive, catastrophic forgetting risk, high storage
Partial (last N layers)	10-30%	8-16GB	1K+	Faster, less catastrophic forgetting	Less flexible, suboptimal on large distribution shifts
LoRA (r=8-32)	0.1-1%	8-16GB	100+	Best trade-off, small adapter, no catastrophic forgetting	Slight overhead at runtime if not merged
QLoRA (4-bit)	0.1-1%	6-12GB	100+	Large LLMs on consumer GPU, minimal costs	Slightly slower, requires bitsandbytes
Adapter layers	1-5%	8-16GB	500+	Multi-task with one base model, modular	Extra latency, more complex architecture
Prompt tuning	<0.1%	8GB	500+	Minimal storage, no weight modification	Lower performance on small datasets
SetFit (sentence-transformers)	100% SBERT	4-8GB	8-64 (few-shot!)	Excellent with very few data points	Classification only, no generation

2. Domain-Adaptive Pre-training (DAPT)

Before task-specific fine-tuning, it is often useful to continue pre-training the model on domain target text (without labels) using MLM. This helps the model acquire the vocabulary and patterns of the specific domain. Research shows that DAPT can improve performance by 5-15% on technical domains.

from transformers import (
    AutoTokenizer,
    AutoModelForMaskedLM,
    DataCollatorForLanguageModeling,
    DataCollatorForWholeWordMask,
    TrainingArguments,
    Trainer
)
from datasets import Dataset
import torch

# Base model to adapt
BASE_MODEL = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
model = AutoModelForMaskedLM.from_pretrained(BASE_MODEL)

# Domain corpus (e.g., medical/legal texts — no labels needed)
domain_texts = [
    "The patient presents with symptoms of congestive heart failure...",
    "Pursuant to the applicable regulations, the contracting parties agree...",
    "The differential diagnosis includes neoplastic and inflammatory pathologies...",
    "Histological examination reveals the presence of atypical cells at...",
    # ... thousands of domain texts
]

def tokenize_corpus(examples, chunk_size=512):
    """Tokenize and split into chunks of max 512 tokens."""
    tokenized = tokenizer(
        examples["text"],
        truncation=False,
        return_special_tokens_mask=True
    )
    all_input_ids, all_attention_masks, all_special_tokens_masks = [], [], []

    for ids, attn, stm in zip(
        tokenized["input_ids"],
        tokenized["attention_mask"],
        tokenized["special_tokens_mask"]
    ):
        for i in range(0, len(ids), chunk_size):
            chunk = ids[i:i+chunk_size]
            if len(chunk) >= 64:
                padded = chunk + [tokenizer.pad_token_id] * (chunk_size - len(chunk))
                attn_chunk = [1] * len(chunk) + [0] * (chunk_size - len(chunk))
                stm_chunk = stm[i:i+chunk_size] + [1] * (chunk_size - len(chunk))
                all_input_ids.append(padded)
                all_attention_masks.append(attn_chunk)
                all_special_tokens_masks.append(stm_chunk)

    return {
        "input_ids": all_input_ids,
        "attention_mask": all_attention_masks,
        "special_tokens_mask": all_special_tokens_masks
    }

domain_dataset = Dataset.from_dict({"text": domain_texts})
tokenized_corpus = domain_dataset.map(tokenize_corpus, batched=True, remove_columns=["text"])

# Whole Word Masking collator (more effective than standard token masking)
data_collator_wwm = DataCollatorForWholeWordMask(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=0.15
)

# DAPT training configuration
dapt_args = TrainingArguments(
    output_dir="./models/bert-domain-dapt",
    num_train_epochs=5,
    per_device_train_batch_size=16,
    learning_rate=5e-5,         # higher LR for DAPT vs fine-tuning
    warmup_ratio=0.05,
    weight_decay=0.01,
    save_steps=500,
    save_total_limit=2,
    fp16=True,
    report_to="none",
    logging_steps=100,
)

dapt_trainer = Trainer(
    model=model,
    args=dapt_args,
    train_dataset=tokenized_corpus,
    data_collator=data_collator_wwm
)

print("Starting DAPT training...")
dapt_trainer.train()

model.save_pretrained("./models/bert-domain-dapt")
tokenizer.save_pretrained("./models/bert-domain-dapt")
print("DAPT complete. The model has acquired domain-specific vocabulary.")

3. LoRA: Mathematics and Implementation

LoRA (Low-Rank Adaptation) is based on the observation that during fine-tuning, weight updates in pre-trained models have a low intrinsic rank. Instead of modifying W ∈ R^(d x k) directly, LoRA parameterizes the update as delta-W = B @ A, where B ∈ R^(d x r) and A ∈ R^(r x k) with r much smaller than min(d, k).

With r=8, BERT-base reduces trainable parameters from 110M to approximately 300K (0.27%). With r=16, it increases to ~600K (0.54%) with better performance. The trade-off: higher rank = more parameters = better performance = more memory.

How to Choose the LoRA Rank r

Rank r	Trainable Parameters	Extra Memory	When to Use
r=4	~0.1%	Minimal	Simple tasks, much data, ultra-light deployment
r=8	~0.25%	Low	Good default for most tasks
r=16	~0.5%	Medium	Complex tasks, recommended best practice
r=32	~1%	Medium-high	Very complex tasks, large distribution shifts
r=64	~2%	High	Near-equivalent to full fine-tuning in some cases

from peft import LoraConfig, get_peft_model, TaskType, PeftModel
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer
from datasets import Dataset
import evaluate
import numpy as np

MODEL = "./models/bert-domain-dapt"  # or "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL)

model = AutoModelForSequenceClassification.from_pretrained(
    MODEL,
    num_labels=4,
    id2label={0: "employment", 1: "sale", 2: "lease", 3: "service"},
    label2id={"employment": 0, "sale": 1, "lease": 2, "service": 3}
)

# Optimized LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=16,                        # optimal rank for classification tasks
    lora_alpha=32,               # scaling = lora_alpha / r = 2.0
    target_modules=[             # layers to modify in BERT
        "query",                 # query projection in multi-head attention
        "key",                   # key projection
        "value",                 # value projection
        "dense"                  # dense layer in attention output and FFN
    ],
    lora_dropout=0.05,
    bias="none",
    modules_to_save=["classifier"]  # classification head always fully trained
)

peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# trainable params: 592,898 || all params: 124,647,170 || trainable%: 0.4756%

# Verify trainable layers
print("\nTrainable layers:")
for name, param in peft_model.named_parameters():
    if param.requires_grad:
        print(f"  {name}: {param.shape}")

# Training dataset
train_data = {
    "text": [
        "The employer agrees to pay the employee a monthly salary of...",
        "The parties agree to the sale of the property located at...",
        "The landlord grants the tenant a lease of the apartment for...",
        "The consultant shall provide IT advisory services for a period of...",
    ],
    "label": [0, 1, 2, 3]
}

def tokenize_fn(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=256)

train_ds = Dataset.from_dict(train_data).map(tokenize_fn, batched=True, remove_columns=["text"])
train_ds.set_format("torch")

accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "f1_macro": f1.compute(predictions=preds, references=labels, average="macro")["f1"]
    }

# LoRA training: higher LR and more epochs vs full fine-tuning
args = TrainingArguments(
    output_dir="./results/bert-legal-lora",
    num_train_epochs=20,          # more epochs for small datasets
    per_device_train_batch_size=8,
    learning_rate=3e-4,           # LoRA uses higher LR (3e-4 vs 2e-5)
    warmup_ratio=0.1,
    weight_decay=0.01,
    eval_strategy="no",
    save_strategy="epoch",
    save_total_limit=2,
    fp16=True,
    report_to="none",
    seed=42
)

trainer = Trainer(
    model=peft_model,
    args=args,
    train_dataset=train_ds,
    compute_metrics=compute_metrics
)

trainer.train()
peft_model.save_pretrained("./models/bert-lora-adapter")
print("\nLoRA adapter saved (~2MB instead of ~500MB!)")

4. QLoRA: Fine-tuning LLMs on Consumer GPU

QLoRA (Dettmers et al., 2023) combines 4-bit quantization with LoRA, enabling fine-tuning of very large models (7B-70B parameters) on consumer GPUs with 6-24GB VRAM. The original paper demonstrated that a LLaMA-65B fine-tuned with QLoRA reaches ChatGPT-level performance on some benchmarks.

VRAM Requirements for QLoRA on Common Models

Model	Parameters	FP16	INT8	NF4 (QLoRA)	Minimum GPU
Mistral-7B	7B	~14GB	~8GB	~5GB	RTX 3070 (8GB)
Llama-2-13B	13B	~26GB	~14GB	~9GB	RTX 3090 (24GB)*
Llama-2-70B	70B	~140GB	~70GB	~40GB	A100 80GB or 2x A40
BERT-base	110M	~0.4GB	~0.2GB	~0.1GB	CPU or any GPU

*With gradient checkpointing and batch size 1

# pip install bitsandbytes accelerate peft trl transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import Dataset
import torch

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NormalFloat4 (optimal for LLMs)
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,     # saves ~0.4 bits per parameter
)

# Load model in 4-bit: Mistral-7B from ~14GB to ~5GB VRAM!
MODEL_NAME = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    attn_implementation="flash_attention_2"  # Flash Attention 2 if available
)

print(f"GPU memory allocated: {torch.cuda.memory_allocated()/1e9:.2f}GB")

model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

# LoRA config for LLM (attention + MLP layers)
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj", "k_proj", "v_proj",    # attention layers
        "o_proj",                          # output projection
        "gate_proj", "up_proj", "down_proj"  # MLP SwiGLU layers
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# trainable params: ~83M || all params: ~3.75B || trainable%: 2.24%

# Dataset in instruction-following format
def format_instruction(instruction: str, input_text: str, output: str) -> str:
    if input_text:
        return (
            f"### Instruction:\n{instruction}\n\n"
            f"### Input:\n{input_text}\n\n"
            f"### Response:\n{output}"
        )
    return f"### Instruction:\n{instruction}\n\n### Response:\n{output}"

train_examples = [
    {
        "text": format_instruction(
            instruction="Classify this text into the appropriate category.",
            input_text="The employer agrees to pay the employee a monthly salary...",
            output="employment"
        )
    },
    {
        "text": format_instruction(
            instruction="Extract the contracting parties from this document.",
            input_text="Between John Smith (seller) and Jane Doe (buyer), it is agreed that...",
            output="Seller: John Smith\nBuyer: Jane Doe"
        )
    },
]

train_dataset = Dataset.from_list(train_examples)

# SFTTrainer for supervised fine-tuning
sft_config = SFTConfig(
    output_dir="./models/mistral-qlora",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,    # effective batch = 4*4 = 16
    warmup_ratio=0.1,
    learning_rate=2e-4,
    bf16=True,                        # bfloat16 more stable than fp16 for LLMs
    logging_steps=10,
    optim="paged_adamw_32bit",        # paged optimizer (saves ~8GB!)
    lr_scheduler_type="cosine",
    max_seq_length=512,
    dataset_text_field="text",
    packing=True,                     # pack short examples for efficiency
    report_to="none",
)

trainer = SFTTrainer(model=peft_model, train_dataset=train_dataset, args=sft_config)
trainer.train()
trainer.save_model("./models/mistral-qlora")
print("QLoRA fine-tuning complete!")

5. Managing Small Datasets

In many real-world scenarios, annotated data is scarce. Here are the most effective strategies to maximize quality with few examples.

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
import random

# =========================================
# Strategy 1: SetFit for few-shot learning (2-64 examples!)
# =========================================
from setfit import SetFitModel, SetFitTrainer

setfit_model = SetFitModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

# Only 8 examples per class (16 total for binary classification)!
train_data = {
    "text": ["Great product, highly recommend!",
             "Poor quality, not worth the price.",
             "Absolutely amazing, exceeded expectations!",
             "Terrible experience, waste of money."],
    "label": [1, 0, 1, 0]
}

from datasets import Dataset
setfit_trainer = SetFitTrainer(
    model=setfit_model,
    train_dataset=Dataset.from_dict(train_data),
    num_iterations=20,
    num_epochs=1,
    batch_size=16,
)
setfit_trainer.train()

# =========================================
# Strategy 2: Gradual unfreezing
# =========================================
def progressive_unfreeze(model, epoch: int, total_epochs: int, num_layers: int = 12):
    """
    Gradual unfreezing: unlock layers from last to first as training progresses.
    Prevents catastrophic forgetting and improves performance with little data.
    """
    layers_to_unfreeze = max(1, int(num_layers * epoch / total_epochs))
    first_layer_to_unfreeze = num_layers - layers_to_unfreeze

    for i, layer in enumerate(model.bert.encoder.layer):
        frozen = (i < first_layer_to_unfreeze)
        for param in layer.parameters():
            param.requires_grad = not frozen

    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"  Epoch {epoch}: unlocked layers {first_layer_to_unfreeze}-{num_layers-1}, "
          f"trainable: {trainable:,}")

# =========================================
# Strategy 3: Layer-wise learning rates
# =========================================
from torch.optim import AdamW

def get_layerwise_lr(model, base_lr: float = 2e-5, lr_decay: float = 0.75) -> list:
    """
    Decreasing learning rate for lower layers.
    Lower layers (syntax, basic semantics) change little;
    higher layers (task-specific features) change more.
    """
    params = [{"params": model.bert.embeddings.parameters(), "lr": base_lr * (lr_decay ** 13)}]

    for i, layer in enumerate(model.bert.encoder.layer):
        lr = base_lr * (lr_decay ** (12 - i))
        params.append({"params": layer.parameters(), "lr": lr})

    params.append({"params": model.bert.pooler.parameters(), "lr": base_lr})
    params.append({"params": model.classifier.parameters(), "lr": base_lr * 10})  # 10x for head
    return params

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)
optimizer = AdamW(get_layerwise_lr(model, base_lr=2e-5, lr_decay=0.75))

# =========================================
# Strategy 4: Data Augmentation
# =========================================
def easy_data_augmentation(text: str, num_aug: int = 4) -> list:
    """Easy Data Augmentation (EDA): random swap, insertion, deletion."""
    words = text.split()
    augmented = []

    for _ in range(num_aug):
        new_words = words.copy()

        # Random Swap
        if len(new_words) >= 2:
            i, j = random.sample(range(len(new_words)), 2)
            new_words[i], new_words[j] = new_words[j], new_words[i]

        augmented.append(" ".join(new_words))

    return augmented

# Back-translation for stronger semantic-preserving augmentation
from transformers import pipeline as hf_pipeline

def back_translate(text: str, src_to_pivot, pivot_to_src) -> str:
    """Back-translation: src -> pivot language -> src (semantically similar variant)."""
    pivoted = src_to_pivot(text, max_length=512)[0]['translation_text']
    back = pivot_to_src(pivoted, max_length=512)[0]['translation_text']
    return back

# Example setup (requires translation models):
# en_to_de = hf_pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")
# de_to_en = hf_pipeline("translation_de_to_en", model="Helsinki-NLP/opus-mt-de-en")
# augmented = back_translate("Great product!", en_to_de, de_to_en)
print("Data augmentation strategies configured!")

6. Avoiding Catastrophic Forgetting

A common risk in fine-tuning is catastrophic forgetting: the model "forgets" general knowledge acquired during pre-training while learning the specific task. Here is how to mitigate it with Elastic Weight Consolidation and other techniques.

import torch
from torch import nn
from typing import Dict, Iterator

class EWC:
    """
    Elastic Weight Consolidation to prevent catastrophic forgetting.
    Penalizes large changes to parameters important for previous tasks.

    Reference: Kirkpatrick et al. (2017) "Overcoming catastrophic forgetting in NNs"
    """
    def __init__(self, model: nn.Module, dataset: Iterator, lambda_ewc: float = 0.4):
        self.model = model
        self.lambda_ewc = lambda_ewc

        # Save original weights
        self._means: Dict[str, torch.Tensor] = {
            n: p.data.clone()
            for n, p in model.named_parameters()
            if p.requires_grad
        }

        # Compute Fisher Information Matrix (diagonal)
        self._fisher = self._compute_fisher(dataset)

    def _compute_fisher(self, dataset: Iterator) -> Dict[str, torch.Tensor]:
        """
        Estimate diagonal FIM as mean of squared gradients.
        Higher value = more important parameter.
        """
        fisher = {n: torch.zeros_like(p) for n, p in self.model.named_parameters() if p.requires_grad}
        self.model.eval()
        n_samples = 0

        for batch in dataset:
            self.model.zero_grad()
            outputs = self.model(**batch)
            outputs.loss.backward()

            for n, p in self.model.named_parameters():
                if p.grad is not None and n in fisher:
                    fisher[n] += p.grad.detach() ** 2
            n_samples += 1

        for n in fisher:
            fisher[n] /= n_samples
        return fisher

    def penalty(self, model: nn.Module) -> torch.Tensor:
        """Compute the EWC penalty to add to the task loss."""
        penalty = torch.tensor(0.0, device=next(model.parameters()).device)
        for n, p in model.named_parameters():
            if n in self._fisher and n in self._means:
                penalty += (self._fisher[n] * (p - self._means[n]) ** 2).sum()
        return self.lambda_ewc * penalty

# Usage in custom training loop:
# ewc = EWC(model, old_task_dataloader, lambda_ewc=0.4)
# loss = task_loss + ewc.penalty(model)

# =========================================
# L2 Regularization toward pretrained weights (simpler alternative)
# =========================================
def l2_penalty_to_pretrained(model: nn.Module, original_params: dict, lambda_l2: float = 0.01) -> torch.Tensor:
    """Penalizes L2 distance from original weights. Simpler than EWC but less accurate."""
    penalty = torch.tensor(0.0)
    for n, p in model.named_parameters():
        if n in original_params:
            penalty += ((p - original_params[n]) ** 2).sum()
    return lambda_l2 * penalty

# =========================================
# Practical tips for catastrophic forgetting prevention
# =========================================
tips = {
    "Low learning rate": "Use lr=1e-5 or lower; BERT is sensitive to large updates",
    "Warmup": "Always use warmup_ratio=0.06-0.1; prevents instability in early steps",
    "Gradual unfreezing": "Start with only the last 2 layers, progressively unlock earlier layers",
    "Early stopping": "Stop training as soon as validation metric stops improving",
    "LoRA": "By design, LoRA does not modify original weights, so no catastrophic forgetting",
}
for k, v in tips.items():
    print(f"  {k}: {v}")

7. Post Fine-tuning Evaluation

Robust evaluation of a fine-tuned model requires more than simple aggregate metrics. It is essential to analyze errors by class, identify failure patterns, and test on out-of-distribution examples.

from sklearn.metrics import classification_report, confusion_matrix
import numpy as np
import pandas as pd
import torch

def comprehensive_evaluation(
    model,
    tokenizer,
    test_texts: list,
    test_labels: list,
    label_names: list,
    batch_size: int = 32,
    device: str = "cuda"
) -> dict:
    """
    Complete evaluation: aggregate metrics, per-class metrics,
    error analysis, calibration, uncertain examples.
    """
    model.eval()
    all_logits, all_labels_list = [], []

    for i in range(0, len(test_texts), batch_size):
        batch_texts = test_texts[i:i+batch_size]
        batch_labels = test_labels[i:i+batch_size]

        inputs = tokenizer(
            batch_texts, return_tensors='pt',
            truncation=True, padding=True, max_length=256
        ).to(device)

        with torch.no_grad():
            outputs = model(**inputs)

        all_logits.append(outputs.logits.cpu().numpy())
        all_labels_list.extend(batch_labels)

    all_logits = np.vstack(all_logits)
    all_probs = np.exp(all_logits) / np.exp(all_logits).sum(axis=1, keepdims=True)
    all_preds = np.argmax(all_logits, axis=1)
    all_labels_arr = np.array(all_labels_list)

    # 1. Detailed classification report
    print("=" * 60)
    print("CLASSIFICATION REPORT")
    print("=" * 60)
    print(classification_report(all_labels_arr, all_preds, target_names=label_names, digits=4))

    # 2. Confidence metrics
    max_probs = all_probs.max(axis=1)
    correct_mask = (all_preds == all_labels_arr)
    print(f"\nAvg confidence (correct): {max_probs[correct_mask].mean():.4f}")
    print(f"Avg confidence (wrong): {max_probs[~correct_mask].mean():.4f}")

    # 3. Uncertain predictions (high entropy)
    entropies = -np.sum(all_probs * np.log(all_probs + 1e-10), axis=1)
    uncertain_mask = entropies > np.percentile(entropies, 80)
    print(f"\nUncertain examples ({uncertain_mask.sum()}/{len(all_labels_arr)}):")
    print(f"  Accuracy on uncertain: {correct_mask[uncertain_mask].mean():.4f}")
    print(f"  Accuracy on certain: {correct_mask[~uncertain_mask].mean():.4f}")

    # 4. High-confidence errors (model is wrong but very confident — dangerous!)
    error_df = pd.DataFrame({
        "text": test_texts,
        "true_label": [label_names[l] for l in all_labels_arr],
        "pred_label": [label_names[p] for p in all_preds],
        "confidence": max_probs,
        "correct": correct_mask
    })

    print("\n=== HIGH-CONFIDENCE ERRORS (confidence > 0.9 but wrong) ===")
    high_conf_errors = error_df[(~error_df["correct"]) & (error_df["confidence"] > 0.9)]
    if len(high_conf_errors) > 0:
        print(high_conf_errors[["text", "true_label", "pred_label", "confidence"]].head(5).to_string())
    else:
        print("No high-confidence errors found!")

    return {"predictions": all_preds, "probabilities": all_probs}

8. Deployment and Model Versioning

Once fine-tuning is complete, structured deployment management is essential. LoRA fine-tuned models can be deployed in two modes: adapter-only (lightweight, requires the base model) or merged (standalone, larger).

from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
from peft import PeftModel
import json
import os
from pathlib import Path
from datetime import datetime

class ModelDeploymentManager:
    """
    Manages deployment of LoRA fine-tuned models.
    Supports: version saving, merging, metadata tracking.
    """

    def __init__(self, output_dir: str):
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)

    def save_version(
        self,
        base_model_name: str,
        adapter_path: str,
        metadata: dict,
        merge: bool = True
    ) -> str:
        """Save a model version with complete metadata."""
        version = datetime.now().strftime("%Y%m%d_%H%M%S")
        version_dir = self.output_dir / f"v_{version}"
        version_dir.mkdir()

        base_model = AutoModelForSequenceClassification.from_pretrained(base_model_name)
        tokenizer = AutoTokenizer.from_pretrained(base_model_name)
        peft_model = PeftModel.from_pretrained(base_model, adapter_path)

        # Save adapter only (~1-5MB)
        adapter_dir = version_dir / "adapter"
        peft_model.save_pretrained(str(adapter_dir))
        tokenizer.save_pretrained(str(adapter_dir))

        merged_dir = None
        if merge:
            # Merge and save full model (for fast inference)
            merged_dir = version_dir / "merged"
            merged_model = peft_model.merge_and_unload()
            merged_model.save_pretrained(str(merged_dir))
            tokenizer.save_pretrained(str(merged_dir))

        deploy_metadata = {
            "version": version,
            "base_model": base_model_name,
            "created_at": datetime.now().isoformat(),
            "adapter_path": str(adapter_dir),
            "merged_path": str(merged_dir) if merged_dir else None,
            "adapter_size_mb": sum(
                os.path.getsize(f) for f in adapter_dir.rglob("*") if f.is_file()
            ) / 1e6,
            **metadata
        }

        with open(version_dir / "metadata.json", "w") as f:
            json.dump(deploy_metadata, f, indent=2)

        print(f"Version {version} saved:")
        print(f"  Adapter: {deploy_metadata['adapter_size_mb']:.1f}MB")
        return str(version_dir)

    def load_for_inference(self, version_dir: str, use_merged: bool = True):
        """Load model for production inference."""
        version_path = Path(version_dir)
        with open(version_path / "metadata.json") as f:
            meta = json.load(f)

        if use_merged and meta.get("merged_path"):
            model = AutoModelForSequenceClassification.from_pretrained(meta["merged_path"])
            tok = AutoTokenizer.from_pretrained(meta["merged_path"])
        else:
            base = AutoModelForSequenceClassification.from_pretrained(meta["base_model"])
            model = PeftModel.from_pretrained(base, meta["adapter_path"])
            tok = AutoTokenizer.from_pretrained(meta["adapter_path"])

        import torch as _torch
        return pipeline("text-classification", model=model, tokenizer=tok,
                       device=0 if _torch.cuda.is_available() else -1), meta

print("ModelDeploymentManager configured!")

9. Efficient Inference After Fine-tuning

After fine-tuning, inference optimization is critical for production deployment. A fine-tuned BERT model can be quantized, exported to ONNX, or served via HuggingFace Text Inference Server for maximum throughput.

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import PeftModel
import torch
import numpy as np
import time

# =========================================
# Post-LoRA inference optimization options
# =========================================

def benchmark_inference_modes(
    base_model_name: str,
    adapter_path: str,
    test_texts: list,
    num_runs: int = 5
):
    """
    Benchmark three deployment modes for a LoRA fine-tuned model:
    1. Adapter mode: base + LoRA adapter (small, slower)
    2. Merged mode: merged weights (faster, larger)
    3. INT8 quantized: ~4x smaller, ~2x faster on CPU
    """
    tokenizer = AutoTokenizer.from_pretrained(base_model_name)

    # ---- Mode 1: Adapter (keeps LoRA separate) ----
    base = AutoModelForSequenceClassification.from_pretrained(base_model_name)
    model_adapter = PeftModel.from_pretrained(base, adapter_path)
    model_adapter.eval()

    # ---- Mode 2: Merged weights ----
    base2 = AutoModelForSequenceClassification.from_pretrained(base_model_name)
    peft_model = PeftModel.from_pretrained(base2, adapter_path)
    model_merged = peft_model.merge_and_unload()
    model_merged.eval()

    # ---- Mode 3: Dynamic INT8 quantization ----
    model_int8 = torch.quantization.quantize_dynamic(
        model_merged,
        {torch.nn.Linear},
        dtype=torch.qint8
    )

    def run_inference(model, texts, n_runs):
        times = []
        for _ in range(n_runs):
            start = time.perf_counter()
            inputs = tokenizer(
                texts, return_tensors='pt',
                truncation=True, padding=True, max_length=128
            )
            with torch.no_grad():
                outputs = model(**inputs)
            times.append((time.perf_counter() - start) * 1000)
        return np.mean(times), np.std(times)

    print("=== Inference Benchmark ===")
    for name, model in [
        ("Adapter", model_adapter),
        ("Merged", model_merged),
        ("INT8 Quantized", model_int8)
    ]:
        avg, std = run_inference(model, test_texts, num_runs)
        param_count = sum(p.numel() for p in model.parameters()) / 1e6
        print(f"\n{name}:")
        print(f"  Avg latency: {avg:.1f}ms ± {std:.1f}ms")
        print(f"  Parameters: {param_count:.1f}M")

# =========================================
# Serving with HuggingFace Inference Endpoints
# =========================================
# Option A: HuggingFace Inference API (cloud)
# Push your model to the Hub and create a dedicated endpoint

# Option B: Text Generation Inference (TGI) for generative models
# docker run -p 8080:80 ghcr.io/huggingface/text-generation-inference:latest
#   --model-id microsoft/Phi-3-mini-4k-instruct

# Option C: Local FastAPI serving (lightweight)
from fastapi import FastAPI
from pydantic import BaseModel
import uvicorn

class ClassificationRequest(BaseModel):
    texts: list
    batch_size: int = 32

class ClassificationResponse(BaseModel):
    predictions: list
    latency_ms: float

# app = FastAPI()
# clf_model = None  # loaded at startup

# @app.on_event("startup")
# async def load_model():
#     global clf_model
#     clf_model = ProductionClassifier("./models/fine-tuned")
#     print("Model loaded successfully!")

# @app.post("/classify", response_model=ClassificationResponse)
# async def classify(request: ClassificationRequest):
#     start = time.perf_counter()
#     predictions = clf_model.predict(request.texts)
#     latency = (time.perf_counter() - start) * 1000
#     return ClassificationResponse(predictions=predictions, latency_ms=latency)

print("FastAPI serving template ready!")
print("Start with: uvicorn serving:app --host 0.0.0.0 --port 8000 --workers 4")

9.1 Continual Fine-tuning and Dataset Versioning

In production, models must be retrained periodically as new labeled data becomes available. Proper versioning of datasets and models ensures reproducibility and enables rollback to previous versions when quality degrades.

import json
import hashlib
from pathlib import Path
from datetime import datetime
from datasets import Dataset, DatasetDict, load_from_disk

class DatasetVersionManager:
    """
    Manages versioned NLP datasets for continual fine-tuning.
    - Deduplication: prevents training on duplicate texts
    - Schema validation: enforces required fields
    - Versioned snapshots: enables reproducibility and rollback
    """

    def __init__(self, storage_dir: str = "./dataset_versions"):
        self.storage_dir = Path(storage_dir)
        self.storage_dir.mkdir(parents=True, exist_ok=True)
        self._index = self._load_index()

    def _load_index(self) -> dict:
        index_path = self.storage_dir / "index.json"
        if index_path.exists():
            with open(index_path) as f:
                return json.load(f)
        return {"versions": []}

    def _save_index(self):
        with open(self.storage_dir / "index.json", "w") as f:
            json.dump(self._index, f, indent=2)

    def _text_hash(self, text: str) -> str:
        return hashlib.sha256(text.encode()).hexdigest()[:16]

    def add_version(
        self,
        dataset_dict: DatasetDict,
        metadata: dict,
        deduplicate: bool = True
    ) -> str:
        """Save a new dataset version with optional deduplication."""
        version_id = datetime.now().strftime("%Y%m%d_%H%M%S")
        version_dir = self.storage_dir / version_id

        if deduplicate:
            # Remove duplicates based on text hash
            for split_name, split in dataset_dict.items():
                seen_hashes = set()
                unique_indices = []
                for i, example in enumerate(split):
                    h = self._text_hash(example.get("text", example.get("sentence", "")))
                    if h not in seen_hashes:
                        seen_hashes.add(h)
                        unique_indices.append(i)

                if len(unique_indices) < len(split):
                    removed = len(split) - len(unique_indices)
                    print(f"  {split_name}: removed {removed} duplicates")
                    dataset_dict[split_name] = split.select(unique_indices)

        # Save dataset
        dataset_dict.save_to_disk(str(version_dir))

        # Update index
        version_entry = {
            "version_id": version_id,
            "created_at": datetime.now().isoformat(),
            "splits": {k: len(v) for k, v in dataset_dict.items()},
            **metadata
        }
        self._index["versions"].append(version_entry)
        self._save_index()

        print(f"Dataset version {version_id} saved:")
        for split, size in version_entry["splits"].items():
            print(f"  {split}: {size} examples")

        return version_id

    def load_version(self, version_id: str) -> DatasetDict:
        """Load a specific dataset version for training."""
        version_dir = self.storage_dir / version_id
        if not version_dir.exists():
            raise ValueError(f"Version {version_id} not found")
        return load_from_disk(str(version_dir))

    def list_versions(self) -> list:
        """List all available dataset versions."""
        return self._index["versions"]

# Usage
manager = DatasetVersionManager("./dataset_versions")

# Example: add initial dataset version
from datasets import Dataset
v1_data = {
    "text": [
        "The product is excellent, highly recommended!",
        "Terrible quality, broke after one day.",
        "Average product, nothing special.",
    ],
    "label": [2, 0, 1]   # 0=negative, 1=neutral, 2=positive
}

v1_dataset = DatasetDict({
    "train": Dataset.from_dict({k: v[:2] for k, v in v1_data.items()}),
    "test": Dataset.from_dict({k: v[2:] for k, v in v1_data.items()})
})

v1_id = manager.add_version(
    v1_dataset,
    metadata={"description": "Initial dataset - product reviews", "labeler": "team_a"}
)

print(f"\nAvailable versions: {len(manager.list_versions())}")

Common Fine-tuning Anti-Patterns

Using the same LR as pre-training: BERT pre-trains at 1e-4; for fine-tuning use 2e-5 (10x lower) to prevent overfitting
No warmup: without warmup, training is unstable in the first iterations; always use warmup_ratio=0.06-0.1
Too few epochs on small datasets: with 100 examples and 3 epochs the model won't converge; use 10-20 epochs with early stopping
Evaluating only on generic benchmarks: a BERT achieving 93% on SST-2 might get only 60% on your specific domain
Not monitoring validation loss: training loss always decreases; monitor validation loss to detect overfitting
Saving best model without metadata: without knowing lr, epochs, dataset, and metrics you can't replicate the training
Not checking data distribution: undetected class imbalance leads to models predicting the majority class constantly

Conclusions and Next Steps

Domain-specific fine-tuning is the key to transforming generic models into highly effective tools for real applications. With LoRA and QLoRA, this is now accessible even with consumer hardware, democratizing access to enterprise-quality models.

The strategy choice depends on context: DAPT for linguistic adaptation, LoRA for the optimal quality/cost balance, QLoRA for large LLMs, SetFit for very few data points. In all cases, rigorous evaluation on the target domain is indispensable.

Key Takeaways

Start with DAPT if you have many unannotated domain texts (5-15% improvement)
LoRA (r=16) offers the best quality/cost trade-off for BERT-size models
QLoRA enables fine-tuning 7B+ LLMs on 8GB GPUs, reducing VRAM by 65%
With few data (<500), use SetFit or layer freezing + layer-wise LR
Gradual unfreezing is the most effective technique for small datasets
EWC is useful for continual learning (maintaining performance across multiple tasks)
Always evaluate on a domain-specific test set, not just generic benchmarks
Implement a ModelDeploymentManager to track versions and metadata

Continue the Modern NLP Series

Previous: HuggingFace Transformers: Complete Guide — ecosystem and Trainer API
Next: Semantic Similarity and Text Matching — SBERT, FAISS, dense retrieval
Article 10: NLP Monitoring in Production — drift detection and automatic retraining
Related series: AI Engineering/RAG — fine-tuned models as RAG components
Related series: Deep Learning Advanced — quantization and advanced optimization
Related series: MLOps — versioning and serving NLP models in production