Fine-tuning NLP Models Locally: Adapting BERT to Your Domain
Pre-trained models like BERT are extremely powerful, but they are trained on generic data. For real applications — legal contract analysis, medical record classification, sentiment on domain-specific reviews, NER on technical texts — domain-specific fine-tuning makes the difference between a mediocre model and an excellent one.
In this article we explore all the techniques for adapting BERT (and LLM models) to your domain: from domain-adaptive pre-training to LoRA fine-tuning on consumer GPUs, from managing annotated data to strategies for maximizing quality with few examples. We include practical code examples and real-world use cases.
This is the eighth article in the Modern NLP: from BERT to LLMs series, classified as Advanced. It assumes familiarity with BERT and the HuggingFace ecosystem (articles 2 and 7).
What You Will Learn
- Fine-tuning strategies: from scratch, partial, full, adapter — systematic comparison
- Domain-Adaptive Pre-training (DAPT) for domain adaptation
- LoRA mathematics: low-rank decomposition and geometric intuition
- Practical LoRA: implementation with PEFT library for classification
- QLoRA: LoRA with 4-bit quantization on consumer GPU (8-16GB)
- Fine-tuning LLMs (LLaMA, Mistral) with TRL and SFTTrainer
- Managing small datasets (<1000 examples): techniques to maximize performance
- NLP data augmentation: back-translation, EDA, synonym replacement
- Techniques to avoid catastrophic forgetting (EWC, gradual unfreezing)
- Post fine-tuning evaluation: domain-specific benchmarks and error analysis
- Model versioning and deployment management
1. Fine-tuning Strategies: A Comparison
There is no single optimal fine-tuning strategy. The choice depends on computational resources, available data quantity, base model size, and performance requirements. The table below provides a practical decision framework.
Fine-tuning Approaches: Decision Guide
| Strategy | Trained Parameters | GPU Required | Data Needed | Pros | Cons |
|---|---|---|---|---|---|
| Full fine-tuning | 100% (all) | 16-80GB | 10K+ | Maximum accuracy, highest adaptability | Expensive, catastrophic forgetting risk, high storage |
| Partial (last N layers) | 10-30% | 8-16GB | 1K+ | Faster, less catastrophic forgetting | Less flexible, suboptimal on large distribution shifts |
| LoRA (r=8-32) | 0.1-1% | 8-16GB | 100+ | Best trade-off, small adapter, no catastrophic forgetting | Slight overhead at runtime if not merged |
| QLoRA (4-bit) | 0.1-1% | 6-12GB | 100+ | Large LLMs on consumer GPU, minimal costs | Slightly slower, requires bitsandbytes |
| Adapter layers | 1-5% | 8-16GB | 500+ | Multi-task with one base model, modular | Extra latency, more complex architecture |
| Prompt tuning | <0.1% | 8GB | 500+ | Minimal storage, no weight modification | Lower performance on small datasets |
| SetFit (sentence-transformers) | 100% SBERT | 4-8GB | 8-64 (few-shot!) | Excellent with very few data points | Classification only, no generation |
2. Domain-Adaptive Pre-training (DAPT)
Before task-specific fine-tuning, it is often useful to continue pre-training the model on domain target text (without labels) using MLM. This helps the model acquire the vocabulary and patterns of the specific domain. Research shows that DAPT can improve performance by 5-15% on technical domains.
from transformers import (
AutoTokenizer,
AutoModelForMaskedLM,
DataCollatorForLanguageModeling,
DataCollatorForWholeWordMask,
TrainingArguments,
Trainer
)
from datasets import Dataset
import torch
# Base model to adapt
BASE_MODEL = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
model = AutoModelForMaskedLM.from_pretrained(BASE_MODEL)
# Domain corpus (e.g., medical/legal texts — no labels needed)
domain_texts = [
"The patient presents with symptoms of congestive heart failure...",
"Pursuant to the applicable regulations, the contracting parties agree...",
"The differential diagnosis includes neoplastic and inflammatory pathologies...",
"Histological examination reveals the presence of atypical cells at...",
# ... thousands of domain texts
]
def tokenize_corpus(examples, chunk_size=512):
"""Tokenize and split into chunks of max 512 tokens."""
tokenized = tokenizer(
examples["text"],
truncation=False,
return_special_tokens_mask=True
)
all_input_ids, all_attention_masks, all_special_tokens_masks = [], [], []
for ids, attn, stm in zip(
tokenized["input_ids"],
tokenized["attention_mask"],
tokenized["special_tokens_mask"]
):
for i in range(0, len(ids), chunk_size):
chunk = ids[i:i+chunk_size]
if len(chunk) >= 64:
padded = chunk + [tokenizer.pad_token_id] * (chunk_size - len(chunk))
attn_chunk = [1] * len(chunk) + [0] * (chunk_size - len(chunk))
stm_chunk = stm[i:i+chunk_size] + [1] * (chunk_size - len(chunk))
all_input_ids.append(padded)
all_attention_masks.append(attn_chunk)
all_special_tokens_masks.append(stm_chunk)
return {
"input_ids": all_input_ids,
"attention_mask": all_attention_masks,
"special_tokens_mask": all_special_tokens_masks
}
domain_dataset = Dataset.from_dict({"text": domain_texts})
tokenized_corpus = domain_dataset.map(tokenize_corpus, batched=True, remove_columns=["text"])
# Whole Word Masking collator (more effective than standard token masking)
data_collator_wwm = DataCollatorForWholeWordMask(
tokenizer=tokenizer,
mlm=True,
mlm_probability=0.15
)
# DAPT training configuration
dapt_args = TrainingArguments(
output_dir="./models/bert-domain-dapt",
num_train_epochs=5,
per_device_train_batch_size=16,
learning_rate=5e-5, # higher LR for DAPT vs fine-tuning
warmup_ratio=0.05,
weight_decay=0.01,
save_steps=500,
save_total_limit=2,
fp16=True,
report_to="none",
logging_steps=100,
)
dapt_trainer = Trainer(
model=model,
args=dapt_args,
train_dataset=tokenized_corpus,
data_collator=data_collator_wwm
)
print("Starting DAPT training...")
dapt_trainer.train()
model.save_pretrained("./models/bert-domain-dapt")
tokenizer.save_pretrained("./models/bert-domain-dapt")
print("DAPT complete. The model has acquired domain-specific vocabulary.")
3. LoRA: Mathematics and Implementation
LoRA (Low-Rank Adaptation) is based on the observation that during fine-tuning, weight updates in pre-trained models have a low intrinsic rank. Instead of modifying W ∈ R^(d x k) directly, LoRA parameterizes the update as delta-W = B @ A, where B ∈ R^(d x r) and A ∈ R^(r x k) with r much smaller than min(d, k).
With r=8, BERT-base reduces trainable parameters from 110M to approximately 300K (0.27%). With r=16, it increases to ~600K (0.54%) with better performance. The trade-off: higher rank = more parameters = better performance = more memory.
How to Choose the LoRA Rank r
| Rank r | Trainable Parameters | Extra Memory | When to Use |
|---|---|---|---|
| r=4 | ~0.1% | Minimal | Simple tasks, much data, ultra-light deployment |
| r=8 | ~0.25% | Low | Good default for most tasks |
| r=16 | ~0.5% | Medium | Complex tasks, recommended best practice |
| r=32 | ~1% | Medium-high | Very complex tasks, large distribution shifts |
| r=64 | ~2% | High | Near-equivalent to full fine-tuning in some cases |
from peft import LoraConfig, get_peft_model, TaskType, PeftModel
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer
from datasets import Dataset
import evaluate
import numpy as np
MODEL = "./models/bert-domain-dapt" # or "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(
MODEL,
num_labels=4,
id2label={0: "employment", 1: "sale", 2: "lease", 3: "service"},
label2id={"employment": 0, "sale": 1, "lease": 2, "service": 3}
)
# Optimized LoRA configuration
lora_config = LoraConfig(
task_type=TaskType.SEQ_CLS,
r=16, # optimal rank for classification tasks
lora_alpha=32, # scaling = lora_alpha / r = 2.0
target_modules=[ # layers to modify in BERT
"query", # query projection in multi-head attention
"key", # key projection
"value", # value projection
"dense" # dense layer in attention output and FFN
],
lora_dropout=0.05,
bias="none",
modules_to_save=["classifier"] # classification head always fully trained
)
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# trainable params: 592,898 || all params: 124,647,170 || trainable%: 0.4756%
# Verify trainable layers
print("\nTrainable layers:")
for name, param in peft_model.named_parameters():
if param.requires_grad:
print(f" {name}: {param.shape}")
# Training dataset
train_data = {
"text": [
"The employer agrees to pay the employee a monthly salary of...",
"The parties agree to the sale of the property located at...",
"The landlord grants the tenant a lease of the apartment for...",
"The consultant shall provide IT advisory services for a period of...",
],
"label": [0, 1, 2, 3]
}
def tokenize_fn(examples):
return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=256)
train_ds = Dataset.from_dict(train_data).map(tokenize_fn, batched=True, remove_columns=["text"])
train_ds.set_format("torch")
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
return {
"accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
"f1_macro": f1.compute(predictions=preds, references=labels, average="macro")["f1"]
}
# LoRA training: higher LR and more epochs vs full fine-tuning
args = TrainingArguments(
output_dir="./results/bert-legal-lora",
num_train_epochs=20, # more epochs for small datasets
per_device_train_batch_size=8,
learning_rate=3e-4, # LoRA uses higher LR (3e-4 vs 2e-5)
warmup_ratio=0.1,
weight_decay=0.01,
eval_strategy="no",
save_strategy="epoch",
save_total_limit=2,
fp16=True,
report_to="none",
seed=42
)
trainer = Trainer(
model=peft_model,
args=args,
train_dataset=train_ds,
compute_metrics=compute_metrics
)
trainer.train()
peft_model.save_pretrained("./models/bert-lora-adapter")
print("\nLoRA adapter saved (~2MB instead of ~500MB!)")
4. QLoRA: Fine-tuning LLMs on Consumer GPU
QLoRA (Dettmers et al., 2023) combines 4-bit quantization with LoRA, enabling fine-tuning of very large models (7B-70B parameters) on consumer GPUs with 6-24GB VRAM. The original paper demonstrated that a LLaMA-65B fine-tuned with QLoRA reaches ChatGPT-level performance on some benchmarks.
VRAM Requirements for QLoRA on Common Models
| Model | Parameters | FP16 | INT8 | NF4 (QLoRA) | Minimum GPU |
|---|---|---|---|---|---|
| Mistral-7B | 7B | ~14GB | ~8GB | ~5GB | RTX 3070 (8GB) |
| Llama-2-13B | 13B | ~26GB | ~14GB | ~9GB | RTX 3090 (24GB)* |
| Llama-2-70B | 70B | ~140GB | ~70GB | ~40GB | A100 80GB or 2x A40 |
| BERT-base | 110M | ~0.4GB | ~0.2GB | ~0.1GB | CPU or any GPU |
*With gradient checkpointing and batch size 1
# pip install bitsandbytes accelerate peft trl transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import Dataset
import torch
# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 (optimal for LLMs)
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # saves ~0.4 bits per parameter
)
# Load model in 4-bit: Mistral-7B from ~14GB to ~5GB VRAM!
MODEL_NAME = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
attn_implementation="flash_attention_2" # Flash Attention 2 if available
)
print(f"GPU memory allocated: {torch.cuda.memory_allocated()/1e9:.2f}GB")
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
# LoRA config for LLM (attention + MLP layers)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=[
"q_proj", "k_proj", "v_proj", # attention layers
"o_proj", # output projection
"gate_proj", "up_proj", "down_proj" # MLP SwiGLU layers
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# trainable params: ~83M || all params: ~3.75B || trainable%: 2.24%
# Dataset in instruction-following format
def format_instruction(instruction: str, input_text: str, output: str) -> str:
if input_text:
return (
f"### Instruction:\n{instruction}\n\n"
f"### Input:\n{input_text}\n\n"
f"### Response:\n{output}"
)
return f"### Instruction:\n{instruction}\n\n### Response:\n{output}"
train_examples = [
{
"text": format_instruction(
instruction="Classify this text into the appropriate category.",
input_text="The employer agrees to pay the employee a monthly salary...",
output="employment"
)
},
{
"text": format_instruction(
instruction="Extract the contracting parties from this document.",
input_text="Between John Smith (seller) and Jane Doe (buyer), it is agreed that...",
output="Seller: John Smith\nBuyer: Jane Doe"
)
},
]
train_dataset = Dataset.from_list(train_examples)
# SFTTrainer for supervised fine-tuning
sft_config = SFTConfig(
output_dir="./models/mistral-qlora",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # effective batch = 4*4 = 16
warmup_ratio=0.1,
learning_rate=2e-4,
bf16=True, # bfloat16 more stable than fp16 for LLMs
logging_steps=10,
optim="paged_adamw_32bit", # paged optimizer (saves ~8GB!)
lr_scheduler_type="cosine",
max_seq_length=512,
dataset_text_field="text",
packing=True, # pack short examples for efficiency
report_to="none",
)
trainer = SFTTrainer(model=peft_model, train_dataset=train_dataset, args=sft_config)
trainer.train()
trainer.save_model("./models/mistral-qlora")
print("QLoRA fine-tuning complete!")
5. Managing Small Datasets
In many real-world scenarios, annotated data is scarce. Here are the most effective strategies to maximize quality with few examples.
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
import random
# =========================================
# Strategy 1: SetFit for few-shot learning (2-64 examples!)
# =========================================
from setfit import SetFitModel, SetFitTrainer
setfit_model = SetFitModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
# Only 8 examples per class (16 total for binary classification)!
train_data = {
"text": ["Great product, highly recommend!",
"Poor quality, not worth the price.",
"Absolutely amazing, exceeded expectations!",
"Terrible experience, waste of money."],
"label": [1, 0, 1, 0]
}
from datasets import Dataset
setfit_trainer = SetFitTrainer(
model=setfit_model,
train_dataset=Dataset.from_dict(train_data),
num_iterations=20,
num_epochs=1,
batch_size=16,
)
setfit_trainer.train()
# =========================================
# Strategy 2: Gradual unfreezing
# =========================================
def progressive_unfreeze(model, epoch: int, total_epochs: int, num_layers: int = 12):
"""
Gradual unfreezing: unlock layers from last to first as training progresses.
Prevents catastrophic forgetting and improves performance with little data.
"""
layers_to_unfreeze = max(1, int(num_layers * epoch / total_epochs))
first_layer_to_unfreeze = num_layers - layers_to_unfreeze
for i, layer in enumerate(model.bert.encoder.layer):
frozen = (i < first_layer_to_unfreeze)
for param in layer.parameters():
param.requires_grad = not frozen
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f" Epoch {epoch}: unlocked layers {first_layer_to_unfreeze}-{num_layers-1}, "
f"trainable: {trainable:,}")
# =========================================
# Strategy 3: Layer-wise learning rates
# =========================================
from torch.optim import AdamW
def get_layerwise_lr(model, base_lr: float = 2e-5, lr_decay: float = 0.75) -> list:
"""
Decreasing learning rate for lower layers.
Lower layers (syntax, basic semantics) change little;
higher layers (task-specific features) change more.
"""
params = [{"params": model.bert.embeddings.parameters(), "lr": base_lr * (lr_decay ** 13)}]
for i, layer in enumerate(model.bert.encoder.layer):
lr = base_lr * (lr_decay ** (12 - i))
params.append({"params": layer.parameters(), "lr": lr})
params.append({"params": model.bert.pooler.parameters(), "lr": base_lr})
params.append({"params": model.classifier.parameters(), "lr": base_lr * 10}) # 10x for head
return params
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)
optimizer = AdamW(get_layerwise_lr(model, base_lr=2e-5, lr_decay=0.75))
# =========================================
# Strategy 4: Data Augmentation
# =========================================
def easy_data_augmentation(text: str, num_aug: int = 4) -> list:
"""Easy Data Augmentation (EDA): random swap, insertion, deletion."""
words = text.split()
augmented = []
for _ in range(num_aug):
new_words = words.copy()
# Random Swap
if len(new_words) >= 2:
i, j = random.sample(range(len(new_words)), 2)
new_words[i], new_words[j] = new_words[j], new_words[i]
augmented.append(" ".join(new_words))
return augmented
# Back-translation for stronger semantic-preserving augmentation
from transformers import pipeline as hf_pipeline
def back_translate(text: str, src_to_pivot, pivot_to_src) -> str:
"""Back-translation: src -> pivot language -> src (semantically similar variant)."""
pivoted = src_to_pivot(text, max_length=512)[0]['translation_text']
back = pivot_to_src(pivoted, max_length=512)[0]['translation_text']
return back
# Example setup (requires translation models):
# en_to_de = hf_pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")
# de_to_en = hf_pipeline("translation_de_to_en", model="Helsinki-NLP/opus-mt-de-en")
# augmented = back_translate("Great product!", en_to_de, de_to_en)
print("Data augmentation strategies configured!")
6. Avoiding Catastrophic Forgetting
A common risk in fine-tuning is catastrophic forgetting: the model "forgets" general knowledge acquired during pre-training while learning the specific task. Here is how to mitigate it with Elastic Weight Consolidation and other techniques.
import torch
from torch import nn
from typing import Dict, Iterator
class EWC:
"""
Elastic Weight Consolidation to prevent catastrophic forgetting.
Penalizes large changes to parameters important for previous tasks.
Reference: Kirkpatrick et al. (2017) "Overcoming catastrophic forgetting in NNs"
"""
def __init__(self, model: nn.Module, dataset: Iterator, lambda_ewc: float = 0.4):
self.model = model
self.lambda_ewc = lambda_ewc
# Save original weights
self._means: Dict[str, torch.Tensor] = {
n: p.data.clone()
for n, p in model.named_parameters()
if p.requires_grad
}
# Compute Fisher Information Matrix (diagonal)
self._fisher = self._compute_fisher(dataset)
def _compute_fisher(self, dataset: Iterator) -> Dict[str, torch.Tensor]:
"""
Estimate diagonal FIM as mean of squared gradients.
Higher value = more important parameter.
"""
fisher = {n: torch.zeros_like(p) for n, p in self.model.named_parameters() if p.requires_grad}
self.model.eval()
n_samples = 0
for batch in dataset:
self.model.zero_grad()
outputs = self.model(**batch)
outputs.loss.backward()
for n, p in self.model.named_parameters():
if p.grad is not None and n in fisher:
fisher[n] += p.grad.detach() ** 2
n_samples += 1
for n in fisher:
fisher[n] /= n_samples
return fisher
def penalty(self, model: nn.Module) -> torch.Tensor:
"""Compute the EWC penalty to add to the task loss."""
penalty = torch.tensor(0.0, device=next(model.parameters()).device)
for n, p in model.named_parameters():
if n in self._fisher and n in self._means:
penalty += (self._fisher[n] * (p - self._means[n]) ** 2).sum()
return self.lambda_ewc * penalty
# Usage in custom training loop:
# ewc = EWC(model, old_task_dataloader, lambda_ewc=0.4)
# loss = task_loss + ewc.penalty(model)
# =========================================
# L2 Regularization toward pretrained weights (simpler alternative)
# =========================================
def l2_penalty_to_pretrained(model: nn.Module, original_params: dict, lambda_l2: float = 0.01) -> torch.Tensor:
"""Penalizes L2 distance from original weights. Simpler than EWC but less accurate."""
penalty = torch.tensor(0.0)
for n, p in model.named_parameters():
if n in original_params:
penalty += ((p - original_params[n]) ** 2).sum()
return lambda_l2 * penalty
# =========================================
# Practical tips for catastrophic forgetting prevention
# =========================================
tips = {
"Low learning rate": "Use lr=1e-5 or lower; BERT is sensitive to large updates",
"Warmup": "Always use warmup_ratio=0.06-0.1; prevents instability in early steps",
"Gradual unfreezing": "Start with only the last 2 layers, progressively unlock earlier layers",
"Early stopping": "Stop training as soon as validation metric stops improving",
"LoRA": "By design, LoRA does not modify original weights, so no catastrophic forgetting",
}
for k, v in tips.items():
print(f" {k}: {v}")
7. Post Fine-tuning Evaluation
Robust evaluation of a fine-tuned model requires more than simple aggregate metrics. It is essential to analyze errors by class, identify failure patterns, and test on out-of-distribution examples.
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np
import pandas as pd
import torch
def comprehensive_evaluation(
model,
tokenizer,
test_texts: list,
test_labels: list,
label_names: list,
batch_size: int = 32,
device: str = "cuda"
) -> dict:
"""
Complete evaluation: aggregate metrics, per-class metrics,
error analysis, calibration, uncertain examples.
"""
model.eval()
all_logits, all_labels_list = [], []
for i in range(0, len(test_texts), batch_size):
batch_texts = test_texts[i:i+batch_size]
batch_labels = test_labels[i:i+batch_size]
inputs = tokenizer(
batch_texts, return_tensors='pt',
truncation=True, padding=True, max_length=256
).to(device)
with torch.no_grad():
outputs = model(**inputs)
all_logits.append(outputs.logits.cpu().numpy())
all_labels_list.extend(batch_labels)
all_logits = np.vstack(all_logits)
all_probs = np.exp(all_logits) / np.exp(all_logits).sum(axis=1, keepdims=True)
all_preds = np.argmax(all_logits, axis=1)
all_labels_arr = np.array(all_labels_list)
# 1. Detailed classification report
print("=" * 60)
print("CLASSIFICATION REPORT")
print("=" * 60)
print(classification_report(all_labels_arr, all_preds, target_names=label_names, digits=4))
# 2. Confidence metrics
max_probs = all_probs.max(axis=1)
correct_mask = (all_preds == all_labels_arr)
print(f"\nAvg confidence (correct): {max_probs[correct_mask].mean():.4f}")
print(f"Avg confidence (wrong): {max_probs[~correct_mask].mean():.4f}")
# 3. Uncertain predictions (high entropy)
entropies = -np.sum(all_probs * np.log(all_probs + 1e-10), axis=1)
uncertain_mask = entropies > np.percentile(entropies, 80)
print(f"\nUncertain examples ({uncertain_mask.sum()}/{len(all_labels_arr)}):")
print(f" Accuracy on uncertain: {correct_mask[uncertain_mask].mean():.4f}")
print(f" Accuracy on certain: {correct_mask[~uncertain_mask].mean():.4f}")
# 4. High-confidence errors (model is wrong but very confident — dangerous!)
error_df = pd.DataFrame({
"text": test_texts,
"true_label": [label_names[l] for l in all_labels_arr],
"pred_label": [label_names[p] for p in all_preds],
"confidence": max_probs,
"correct": correct_mask
})
print("\n=== HIGH-CONFIDENCE ERRORS (confidence > 0.9 but wrong) ===")
high_conf_errors = error_df[(~error_df["correct"]) & (error_df["confidence"] > 0.9)]
if len(high_conf_errors) > 0:
print(high_conf_errors[["text", "true_label", "pred_label", "confidence"]].head(5).to_string())
else:
print("No high-confidence errors found!")
return {"predictions": all_preds, "probabilities": all_probs}
8. Deployment and Model Versioning
Once fine-tuning is complete, structured deployment management is essential. LoRA fine-tuned models can be deployed in two modes: adapter-only (lightweight, requires the base model) or merged (standalone, larger).
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
from peft import PeftModel
import json
import os
from pathlib import Path
from datetime import datetime
class ModelDeploymentManager:
"""
Manages deployment of LoRA fine-tuned models.
Supports: version saving, merging, metadata tracking.
"""
def __init__(self, output_dir: str):
self.output_dir = Path(output_dir)
self.output_dir.mkdir(parents=True, exist_ok=True)
def save_version(
self,
base_model_name: str,
adapter_path: str,
metadata: dict,
merge: bool = True
) -> str:
"""Save a model version with complete metadata."""
version = datetime.now().strftime("%Y%m%d_%H%M%S")
version_dir = self.output_dir / f"v_{version}"
version_dir.mkdir()
base_model = AutoModelForSequenceClassification.from_pretrained(base_model_name)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
peft_model = PeftModel.from_pretrained(base_model, adapter_path)
# Save adapter only (~1-5MB)
adapter_dir = version_dir / "adapter"
peft_model.save_pretrained(str(adapter_dir))
tokenizer.save_pretrained(str(adapter_dir))
merged_dir = None
if merge:
# Merge and save full model (for fast inference)
merged_dir = version_dir / "merged"
merged_model = peft_model.merge_and_unload()
merged_model.save_pretrained(str(merged_dir))
tokenizer.save_pretrained(str(merged_dir))
deploy_metadata = {
"version": version,
"base_model": base_model_name,
"created_at": datetime.now().isoformat(),
"adapter_path": str(adapter_dir),
"merged_path": str(merged_dir) if merged_dir else None,
"adapter_size_mb": sum(
os.path.getsize(f) for f in adapter_dir.rglob("*") if f.is_file()
) / 1e6,
**metadata
}
with open(version_dir / "metadata.json", "w") as f:
json.dump(deploy_metadata, f, indent=2)
print(f"Version {version} saved:")
print(f" Adapter: {deploy_metadata['adapter_size_mb']:.1f}MB")
return str(version_dir)
def load_for_inference(self, version_dir: str, use_merged: bool = True):
"""Load model for production inference."""
version_path = Path(version_dir)
with open(version_path / "metadata.json") as f:
meta = json.load(f)
if use_merged and meta.get("merged_path"):
model = AutoModelForSequenceClassification.from_pretrained(meta["merged_path"])
tok = AutoTokenizer.from_pretrained(meta["merged_path"])
else:
base = AutoModelForSequenceClassification.from_pretrained(meta["base_model"])
model = PeftModel.from_pretrained(base, meta["adapter_path"])
tok = AutoTokenizer.from_pretrained(meta["adapter_path"])
import torch as _torch
return pipeline("text-classification", model=model, tokenizer=tok,
device=0 if _torch.cuda.is_available() else -1), meta
print("ModelDeploymentManager configured!")
9. Efficient Inference After Fine-tuning
After fine-tuning, inference optimization is critical for production deployment. A fine-tuned BERT model can be quantized, exported to ONNX, or served via HuggingFace Text Inference Server for maximum throughput.
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import PeftModel
import torch
import numpy as np
import time
# =========================================
# Post-LoRA inference optimization options
# =========================================
def benchmark_inference_modes(
base_model_name: str,
adapter_path: str,
test_texts: list,
num_runs: int = 5
):
"""
Benchmark three deployment modes for a LoRA fine-tuned model:
1. Adapter mode: base + LoRA adapter (small, slower)
2. Merged mode: merged weights (faster, larger)
3. INT8 quantized: ~4x smaller, ~2x faster on CPU
"""
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
# ---- Mode 1: Adapter (keeps LoRA separate) ----
base = AutoModelForSequenceClassification.from_pretrained(base_model_name)
model_adapter = PeftModel.from_pretrained(base, adapter_path)
model_adapter.eval()
# ---- Mode 2: Merged weights ----
base2 = AutoModelForSequenceClassification.from_pretrained(base_model_name)
peft_model = PeftModel.from_pretrained(base2, adapter_path)
model_merged = peft_model.merge_and_unload()
model_merged.eval()
# ---- Mode 3: Dynamic INT8 quantization ----
model_int8 = torch.quantization.quantize_dynamic(
model_merged,
{torch.nn.Linear},
dtype=torch.qint8
)
def run_inference(model, texts, n_runs):
times = []
for _ in range(n_runs):
start = time.perf_counter()
inputs = tokenizer(
texts, return_tensors='pt',
truncation=True, padding=True, max_length=128
)
with torch.no_grad():
outputs = model(**inputs)
times.append((time.perf_counter() - start) * 1000)
return np.mean(times), np.std(times)
print("=== Inference Benchmark ===")
for name, model in [
("Adapter", model_adapter),
("Merged", model_merged),
("INT8 Quantized", model_int8)
]:
avg, std = run_inference(model, test_texts, num_runs)
param_count = sum(p.numel() for p in model.parameters()) / 1e6
print(f"\n{name}:")
print(f" Avg latency: {avg:.1f}ms ± {std:.1f}ms")
print(f" Parameters: {param_count:.1f}M")
# =========================================
# Serving with HuggingFace Inference Endpoints
# =========================================
# Option A: HuggingFace Inference API (cloud)
# Push your model to the Hub and create a dedicated endpoint
# Option B: Text Generation Inference (TGI) for generative models
# docker run -p 8080:80 ghcr.io/huggingface/text-generation-inference:latest
# --model-id microsoft/Phi-3-mini-4k-instruct
# Option C: Local FastAPI serving (lightweight)
from fastapi import FastAPI
from pydantic import BaseModel
import uvicorn
class ClassificationRequest(BaseModel):
texts: list
batch_size: int = 32
class ClassificationResponse(BaseModel):
predictions: list
latency_ms: float
# app = FastAPI()
# clf_model = None # loaded at startup
# @app.on_event("startup")
# async def load_model():
# global clf_model
# clf_model = ProductionClassifier("./models/fine-tuned")
# print("Model loaded successfully!")
# @app.post("/classify", response_model=ClassificationResponse)
# async def classify(request: ClassificationRequest):
# start = time.perf_counter()
# predictions = clf_model.predict(request.texts)
# latency = (time.perf_counter() - start) * 1000
# return ClassificationResponse(predictions=predictions, latency_ms=latency)
print("FastAPI serving template ready!")
print("Start with: uvicorn serving:app --host 0.0.0.0 --port 8000 --workers 4")
9.1 Continual Fine-tuning and Dataset Versioning
In production, models must be retrained periodically as new labeled data becomes available. Proper versioning of datasets and models ensures reproducibility and enables rollback to previous versions when quality degrades.
import json
import hashlib
from pathlib import Path
from datetime import datetime
from datasets import Dataset, DatasetDict, load_from_disk
class DatasetVersionManager:
"""
Manages versioned NLP datasets for continual fine-tuning.
- Deduplication: prevents training on duplicate texts
- Schema validation: enforces required fields
- Versioned snapshots: enables reproducibility and rollback
"""
def __init__(self, storage_dir: str = "./dataset_versions"):
self.storage_dir = Path(storage_dir)
self.storage_dir.mkdir(parents=True, exist_ok=True)
self._index = self._load_index()
def _load_index(self) -> dict:
index_path = self.storage_dir / "index.json"
if index_path.exists():
with open(index_path) as f:
return json.load(f)
return {"versions": []}
def _save_index(self):
with open(self.storage_dir / "index.json", "w") as f:
json.dump(self._index, f, indent=2)
def _text_hash(self, text: str) -> str:
return hashlib.sha256(text.encode()).hexdigest()[:16]
def add_version(
self,
dataset_dict: DatasetDict,
metadata: dict,
deduplicate: bool = True
) -> str:
"""Save a new dataset version with optional deduplication."""
version_id = datetime.now().strftime("%Y%m%d_%H%M%S")
version_dir = self.storage_dir / version_id
if deduplicate:
# Remove duplicates based on text hash
for split_name, split in dataset_dict.items():
seen_hashes = set()
unique_indices = []
for i, example in enumerate(split):
h = self._text_hash(example.get("text", example.get("sentence", "")))
if h not in seen_hashes:
seen_hashes.add(h)
unique_indices.append(i)
if len(unique_indices) < len(split):
removed = len(split) - len(unique_indices)
print(f" {split_name}: removed {removed} duplicates")
dataset_dict[split_name] = split.select(unique_indices)
# Save dataset
dataset_dict.save_to_disk(str(version_dir))
# Update index
version_entry = {
"version_id": version_id,
"created_at": datetime.now().isoformat(),
"splits": {k: len(v) for k, v in dataset_dict.items()},
**metadata
}
self._index["versions"].append(version_entry)
self._save_index()
print(f"Dataset version {version_id} saved:")
for split, size in version_entry["splits"].items():
print(f" {split}: {size} examples")
return version_id
def load_version(self, version_id: str) -> DatasetDict:
"""Load a specific dataset version for training."""
version_dir = self.storage_dir / version_id
if not version_dir.exists():
raise ValueError(f"Version {version_id} not found")
return load_from_disk(str(version_dir))
def list_versions(self) -> list:
"""List all available dataset versions."""
return self._index["versions"]
# Usage
manager = DatasetVersionManager("./dataset_versions")
# Example: add initial dataset version
from datasets import Dataset
v1_data = {
"text": [
"The product is excellent, highly recommended!",
"Terrible quality, broke after one day.",
"Average product, nothing special.",
],
"label": [2, 0, 1] # 0=negative, 1=neutral, 2=positive
}
v1_dataset = DatasetDict({
"train": Dataset.from_dict({k: v[:2] for k, v in v1_data.items()}),
"test": Dataset.from_dict({k: v[2:] for k, v in v1_data.items()})
})
v1_id = manager.add_version(
v1_dataset,
metadata={"description": "Initial dataset - product reviews", "labeler": "team_a"}
)
print(f"\nAvailable versions: {len(manager.list_versions())}")
Common Fine-tuning Anti-Patterns
- Using the same LR as pre-training: BERT pre-trains at 1e-4; for fine-tuning use 2e-5 (10x lower) to prevent overfitting
- No warmup: without warmup, training is unstable in the first iterations; always use warmup_ratio=0.06-0.1
- Too few epochs on small datasets: with 100 examples and 3 epochs the model won't converge; use 10-20 epochs with early stopping
- Evaluating only on generic benchmarks: a BERT achieving 93% on SST-2 might get only 60% on your specific domain
- Not monitoring validation loss: training loss always decreases; monitor validation loss to detect overfitting
- Saving best model without metadata: without knowing lr, epochs, dataset, and metrics you can't replicate the training
- Not checking data distribution: undetected class imbalance leads to models predicting the majority class constantly
Conclusions and Next Steps
Domain-specific fine-tuning is the key to transforming generic models into highly effective tools for real applications. With LoRA and QLoRA, this is now accessible even with consumer hardware, democratizing access to enterprise-quality models.
The strategy choice depends on context: DAPT for linguistic adaptation, LoRA for the optimal quality/cost balance, QLoRA for large LLMs, SetFit for very few data points. In all cases, rigorous evaluation on the target domain is indispensable.
Key Takeaways
- Start with DAPT if you have many unannotated domain texts (5-15% improvement)
- LoRA (r=16) offers the best quality/cost trade-off for BERT-size models
- QLoRA enables fine-tuning 7B+ LLMs on 8GB GPUs, reducing VRAM by 65%
- With few data (<500), use SetFit or layer freezing + layer-wise LR
- Gradual unfreezing is the most effective technique for small datasets
- EWC is useful for continual learning (maintaining performance across multiple tasks)
- Always evaluate on a domain-specific test set, not just generic benchmarks
- Implement a ModelDeploymentManager to track versions and metadata
Continue the Modern NLP Series
- Previous: HuggingFace Transformers: Complete Guide — ecosystem and Trainer API
- Next: Semantic Similarity and Text Matching — SBERT, FAISS, dense retrieval
- Article 10: NLP Monitoring in Production — drift detection and automatic retraining
- Related series: AI Engineering/RAG — fine-tuned models as RAG components
- Related series: Deep Learning Advanced — quantization and advanced optimization
- Related series: MLOps — versioning and serving NLP models in production







