02 - Fine-Tuning Transformers: LoRA, QLoRA, DoRA
Pre-trained Transformer models such as Llama 3, Mistral, GPT-4 and Claude possess impressive knowledge of language and reasoning, yet they are rarely ready out of the box for a specific task. To adapt them to our domain, whether it is classifying corporate emails, generating code in a proprietary framework or answering medical questions, we need fine-tuning.
The problem is that these models have billions of parameters: LLaMA 2 has 7 billion in its smallest version, 70 billion in its largest. A full fine-tuning run on LLaMA-7B requires roughly 28 GB of VRAM just for the weights in FP32, plus just as much for the Adam optimizer states (first and second moments), plus memory for gradients and activations. In practice you need 4x A100 80 GB GPUs for a single fine-tuning run.
In this second article of the Advanced Deep Learning and Edge Deployment series, we will explore Parameter-Efficient Fine-Tuning (PEFT) techniques that allow you to adapt models with billions of parameters using a fraction of the resources. We will start from the problem with full fine-tuning, then analyze LoRA, QLoRA, DoRA and Adapter Layers in detail, with complete Python implementations using HuggingFace PEFT.
Series Overview
| # | Article | Focus |
|---|---|---|
| 1 | Attention Mechanism in Transformers | Self-attention, multi-head, full architecture |
| 2 | You are here - Fine-Tuning with LoRA, QLoRA and Adapters | Parameter-efficient fine-tuning |
| 3 | Model Quantization | INT8, INT4, GPTQ, AWQ |
| 4 | Pruning and Compression | Parameter reduction, distillation |
| 5 | Knowledge Distillation | Teacher-student, knowledge transfer |
| 6 | Ollama and Local LLMs | Local inference, optimization |
| 7 | Vision Transformers | ViT, DINO, image classification |
| 8 | Edge Deployment | ONNX, TensorRT, mobile devices |
| 9 | NAS and AutoML | Neural Architecture Search |
| 10 | Benchmarks and Optimization | Profiling, metrics, tuning |
What You Will Learn
- Why full fine-tuning is unsustainable for models with billions of parameters
- A complete overview of PEFT techniques: prefix tuning, prompt tuning, adapters, LoRA
- The mathematics of LoRA: low-rank decomposition, alpha scaling, target modules
- QLoRA: how to combine 4-bit quantization with LoRA for fine-tuning on consumer GPUs
- DoRA: the weight decomposition into magnitude and direction that improves on LoRA
- Adapter Layers: bottleneck adapters, parallel adapters, adapter fusion
- Complete practical implementation with HuggingFace PEFT and SFTTrainer
- Dataset preparation: Alpaca format, chat template, conversational format
- Hyperparameter tuning: rank, alpha, learning rate, batch size
- Comparison of LoRA vs QLoRA vs DoRA vs full fine-tuning with real benchmarks
- Hardware requirements and consumer GPU limitations
1. The Problem with Full Fine-Tuning
Full fine-tuning consists of updating all parameters of a pre-trained model during training on a specific dataset. Every weight, from embedding layers to attention heads, is modified through backpropagation. This approach worked well for moderately sized models like BERT (110M parameters), but becomes unsustainable when scaling to billions of parameters.
1.1 Computational Costs
To understand why full fine-tuning is prohibitive, let us analyze the memory requirements for LLaMA 2 7B:
Full Fine-Tuning Memory Footprint (LLaMA 2 7B)
| Component | Formula | Memory |
|---|---|---|
| Model weights (FP16) | 7B * 2 bytes | 14 GB |
| Gradients (FP16) | 7B * 2 bytes | 14 GB |
| Optimizer states (Adam FP32) | 7B * 4 bytes * 2 (m, v) | 56 GB |
| Activations (batch_size=1) | Variable | ~8-16 GB |
| Total | ~92-106 GB |
No consumer GPU can handle 100 GB of VRAM. Even the A100 80 GB is insufficient without techniques like gradient checkpointing and mixed precision. For LLaMA 70B, we are talking about over 1 TB of required memory.
1.2 Catastrophic Forgetting
The second problem with full fine-tuning is catastrophic forgetting. When we update all parameters on a specific dataset, the model tends to forget the knowledge acquired during pre-training. A model fine-tuned on legal documents might lose its ability to generate Python code or answer history questions.
This happens because full fine-tuning indiscriminately modifies all weights, including those that encode general language knowledge. In practice, the model overwrites its previous knowledge with the new one.
1.3 Storage for Multiple Models
If an organization needs to adapt the same base model to 10 different tasks (email classification, sentiment analysis, medical Q&A, report generation, etc.), full fine-tuning requires saving 10 complete copies of the model: for LLaMA 7B, that is 140 GB of storage. With LoRA, as we will see, each adaptation requires only 10-50 MB, for a total of 100-500 MB.
Why Full Fine-Tuning Does Not Scale
| Problem | Full Fine-Tuning | PEFT (LoRA) |
|---|---|---|
| VRAM for LLaMA 7B | ~100 GB | ~16-24 GB |
| Trainable parameters | 7,000,000,000 | ~4,000,000 (0.06%) |
| Storage per adaptation | 14 GB | 10-50 MB |
| Catastrophic forgetting | High risk | Minimal risk |
| Hardware required | 4x A100 80GB | 1x RTX 3090 24GB |
2. Parameter-Efficient Fine-Tuning (PEFT): Overview
PEFT techniques share a fundamental principle: instead of updating all model parameters, we freeze the pre-trained weights and train only a small subset of additional parameters. This drastically reduces memory requirements, speeds up training and preserves the base model's knowledge.
PEFT Techniques Taxonomy
| Technique | Where It Acts | Extra Parameters | Advantages |
|---|---|---|---|
| Prompt Tuning | Input embeddings | Soft tokens prepended to input | Very simple, few parameters |
| Prefix Tuning | Every attention layer | Trainable K,V prefixes | More expressive than prompt tuning |
| Adapter Layers | Between Transformer layers | Inserted bottleneck modules | Flexible, composable |
| LoRA | Existing weight matrices | Low-rank matrices B and A | Zero inference overhead |
| QLoRA | Like LoRA + quantization | LoRA on 4-bit model | Fine-tuning on consumer GPUs |
| DoRA | Like LoRA + decomposition | Magnitude + direction | Higher quality than LoRA |
2.1 Prompt Tuning
Prompt tuning (Lester et al., 2021) adds a set of trainable soft tokens at the beginning of the input. These tokens do not correspond to real words in the vocabulary but are continuous vectors that the model learns during training. The base model remains completely frozen: only the soft token embeddings are trained.
Original input: [Classify this email: "Meeting at 3pm"]
With prompt tuning: [T1][T2][T3]...[T20] [Classify this email: "Meeting at 3pm"]
^^^^^^^^^^^^^^^^^
20 trainable soft tokens
(20 * d_model parameters = 20 * 4096 = 81,920 parameters)
The base model (7B parameters) is FROZEN.
Only 81,920 parameters are updated.
Prompt tuning works surprisingly well for very large models (> 10B parameters), but loses quality with smaller models. With T5-XXL (11B), prompt tuning achieves performance nearly identical to full fine-tuning on SuperGLUE.
2.2 Prefix Tuning
Prefix tuning (Li & Liang, 2021) extends the prompt tuning idea: instead of adding soft tokens only to the input, it adds trainable prefixes to the keys (K) and values (V) of every attention layer. This gives the model more expressive power compared to simple prompt tuning.
Prompt Tuning:
Layer 1 Attention: Q=[input], K=[input], V=[input]
Layer 2 Attention: Q=[input], K=[input], V=[input]
Only soft tokens prepended to the initial input.
Prefix Tuning:
Layer 1 Attention: Q=[input], K=[prefix_1 | input], V=[prefix_1 | input]
Layer 2 Attention: Q=[input], K=[prefix_2 | input], V=[prefix_2 | input]
...
Layer L Attention: Q=[input], K=[prefix_L | input], V=[prefix_L | input]
Trainable prefixes at EVERY attention layer.
Parameters: L * prefix_len * 2 * d_model (2 for K and V)
Example: 32 layers * 20 prefix * 2 * 4096 = 5,242,880 parameters
3. LoRA: Low-Rank Adaptation
LoRA (Hu et al., 2021) is the most widely used PEFT technique and represents a turning point in fine-tuning large language models. The key idea is elegant in its simplicity: instead of directly updating the Transformer weight matrices, we decompose the update into two low-rank matrices.
3.1 The Intuition: Intrinsic Rank Hypothesis
Research (Aghajanyan et al., 2020) has shown that when we fine-tune on a specific task, the weight updates concentrate in a low-dimensional subspace. In other words, the weight update matrix has an intrinsic rank much lower than its nominal dimension.
Consider a weight matrix W of dimensions d x k (for example, the query projection in the attention of LLaMA 7B: 4096 x 4096). The full update would require a matrix of 4096 x 4096 = 16,777,216 parameters. But if the update has a low intrinsic rank, we can approximate it with a much smaller rank r.
3.2 The Mathematics of LoRA
Given a pre-trained weight matrix W_0 of dimensions d x k, LoRA parametrizes the update as the product of two low-rank matrices:
Fundamental LoRA Formula
W = W_0 + \Delta W = W_0 + B \cdot A
Where:
- W_0 \in \mathbb{R}^{d \times k} is the pre-trained weight matrix (frozen)
- B \in \mathbb{R}^{d \times r} is the down-projection matrix
- A \in \mathbb{R}^{r \times k} is the up-projection matrix
- r \ll \min(d, k) is the rank (typically 4, 8, 16, 32, 64)
With scaling: h = W_0 x + \frac{\alpha}{r} \cdot B A x
where \alpha is a scaling factor that controls the intensity of the adaptation.
Original matrix W_0 (frozen):
k = 4096
+------------------+
| |
d = | W_0 | 16,777,216 parameters
4096| (FROZEN) | Not updated
| |
+------------------+
LoRA update (trainable):
r = 16 k = 4096
+-----+ +------------------+
| | | |
d = | B | x r=16 | A |
4096| | | |
| | +------------------+
+-----+
65,536 65,536
B (d x r) A (r x k)
= 4096 * 16 = 16 * 4096
= 65,536 = 65,536
Total LoRA: 131,072 parameters (0.78% of 16,777,216)
3.3 Initialization
The initialization of A and B is crucial. LoRA initializes:
- A with Kaiming uniform initialization (Gaussian distribution)
- B with all zeros
This guarantees that at the start of training the LoRA contribution is exactly zero (B * A = 0), so the model starts from the behavior of the pre-trained model. Training gradually modifies B and A to adapt the model to the specific task.
3.4 Scaling Factor Alpha
The parameter \alpha (alpha) controls the intensity of the LoRA adaptation. The effective output is:
h = W_0 x + \frac{\alpha}{r} \cdot B A x
The ratio \alpha / r works as a learning rate for LoRA. In practice, \alpha = 2r is often used (for example \alpha = 32 with r = 16), yielding a scaling factor of 2. This allows changing the rank without having to readjust the learning rate.
3.5 Target Modules
LoRA is not applied to all Transformer layers, but only to specific weight matrices. The choice of target modules significantly influences fine-tuning quality:
Target Modules for Common Architectures
| Module | Type | Dimensions (LLaMA 7B) | Impact |
|---|---|---|---|
q_proj |
Attention Query | 4096 x 4096 | High - controls what the model "looks for" |
k_proj |
Attention Key | 4096 x 4096 | High - controls how tokens are indexed |
v_proj |
Attention Value | 4096 x 4096 | High - controls extracted information |
o_proj |
Attention Output | 4096 x 4096 | Medium |
gate_proj |
FFN Gate | 4096 x 11008 | High - controls information flow in FFN |
up_proj |
FFN Up | 4096 x 11008 | Medium-High |
down_proj |
FFN Down | 11008 x 4096 | Medium |
3.6 Exact Calculation of Trainable Parameters
Let us calculate the trainable parameters for LLaMA 2 7B with LoRA applied to q_proj and v_proj across all 32 layers:
LLaMA 2 7B: 32 layers, d_model = 4096
With r = 16, target_modules = ["q_proj", "v_proj"]:
Per layer:
q_proj LoRA: B (4096 x 16) + A (16 x 4096) = 65,536 + 65,536 = 131,072
v_proj LoRA: B (4096 x 16) + A (16 x 4096) = 65,536 + 65,536 = 131,072
Total per layer: 262,144
Total: 32 layers * 262,144 = 8,388,608 parameters (8.4M)
Ratio: 8.4M / 6,738M (total LLaMA 7B) = 0.12% of parameters
With target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]:
Per layer: 7 modules * 131,072 (average) = ~917,504
Total: 32 * 917,504 = ~29.4M parameters = 0.44%
3.7 Inference Advantage: Zero Overhead
The key advantage of LoRA over Adapter Layers is that at inference time there is no computational overhead. The LoRA matrices can be merged into the base model weights:
W_{merged} = W_0 + \frac{\alpha}{r} \cdot B \cdot A
After merging, the model has exactly the same size and inference speed as the original model, but with updated weights. This is impossible with adapter layers, which add permanent parameters to the model.
4. QLoRA: Quantized LoRA
QLoRA (Dettmers et al., 2023) is an extension of LoRA that combines 4-bit quantization of the base model with LoRA fine-tuning. This makes it possible to fine-tune models with 65 billion parameters on a single 48 GB GPU, a result that was previously impossible.
4.1 The Three Innovations of QLoRA
QLoRA introduces three key techniques:
QLoRA Innovations
- 4-bit NormalFloat (NF4): A new quantized data type optimized for normally distributed weights. Each weight is mapped to one of 16 values (4 bits) chosen to minimize quantization error on a Gaussian distribution. NF4 outperforms INT4 and FP4 because neural network weights approximately follow a normal distribution.
- Double Quantization: The quantization constants (one per block of 64 weights) are themselves quantized to 8-bit. This saves an additional 0.37 bits per parameter, roughly 3 GB for a 65B model.
- Paged Optimizers: Uses NVIDIA GPU paged memory (unified memory) to handle memory spikes during training, automatically swapping optimizer states between GPU and CPU when needed.
4.2 How NF4 Works
NF4 quantization is based on the observation that weights of well-trained neural networks follow an approximately normal distribution with zero mean. NF4 divides the standard normal distribution into 16 equal-probability intervals and assigns each interval the optimal value that minimizes the expected error:
INT4 (evenly spaced):
Levels: [-8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7]
Problem: many levels in the tails (few weights there), few at center (many weights there)
NF4 (normal quantiles):
Levels: [-1.0, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911, 0.0,
0.0796, 0.1609, 0.2461, 0.3379, 0.4407, 0.5626, 0.7230, 1.0]
Advantage: dense levels where most weights are (near 0)
Result: NF4 produces ~2x lower quantization error than INT4
4.3 QLoRA Memory Footprint
Memory Comparison: Full FT vs LoRA vs QLoRA (LLaMA 7B)
| Component | Full FT (FP16) | LoRA (FP16) | QLoRA (NF4) |
|---|---|---|---|
| Model weights | 14 GB (FP16) | 14 GB (FP16) | 3.5 GB (NF4) |
| Gradients | 14 GB | ~16 MB | ~16 MB |
| Optimizer states | 56 GB | ~64 MB | ~64 MB |
| Activations | ~12 GB | ~12 GB | ~6 GB |
| Total | ~96 GB | ~26 GB | ~10 GB |
With QLoRA, fine-tuning LLaMA 7B becomes possible on a single RTX 3090 (24 GB), and LLaMA 13B on a single RTX 4090 (24 GB) with gradient checkpointing enabled.
5. DoRA: Weight-Decomposed Low-Rank Adaptation
DoRA (Liu et al., 2024) is an evolution of LoRA that decomposes the weight matrix into two components: magnitude and direction. This decomposition is inspired by the classical Weight Normalization from Salimans & Kingma (2016) and aims to bridge the quality gap between LoRA and full fine-tuning.
5.1 The Intuition Behind DoRA
Analysis of learning patterns shows a fundamental difference between full fine-tuning and LoRA: full fine-tuning modifies both the magnitude and direction of weights independently, while LoRA tends to modify them proportionally, limiting its expressiveness.
5.2 The Mathematics of DoRA
DoRA decomposes the pre-trained weight matrix into:
DoRA Formula
W' = m \cdot \frac{W_0 + BA}{\|W_0 + BA\|_c}
Where:
- m \in \mathbb{R}^{1 \times k} is the magnitude vector (trainable), initialized with the column norms of W_0
- W_0 + BA represents the direction updated via LoRA
- \|\cdot\|_c is the column-wise norm, normalizing each column to unit norm
- B and A are the standard LoRA matrices (trainable)
In this way, DoRA has two independent degrees of freedom:
- The direction is controlled by B and A (exactly as in LoRA)
- The magnitude is controlled by the vector m, which can change independently from the direction
5.3 DoRA Advantages Over LoRA
DoRA vs LoRA Benchmark (Commonsense Reasoning, LLaMA-7B)
| Task | LoRA (r=32) | DoRA (r=32) | Full FT |
|---|---|---|---|
| BoolQ | 69.8 | 71.8 | 73.2 |
| PIQA | 82.1 | 83.2 | 83.9 |
| WinoGrande | 79.4 | 80.6 | 81.5 |
| HellaSwag | 83.6 | 84.4 | 85.1 |
| Average | 78.7 | 80.0 | 80.9 |
DoRA closes the gap between LoRA and full fine-tuning by approximately 60-70%, with minimal parameter overhead (only the additional m vector).
6. Adapter Layers
Adapter Layers (Houlsby et al., 2019) were among the first PEFT techniques proposed. The idea is simple: insert small trainable modules (adapters) between the existing Transformer layers, while keeping the original weights frozen.
6.1 Bottleneck Adapters
A classic adapter is a bottleneck module with three components:
- Down-projection: reduces dimension from d_model to d_bottleneck (e.g., 4096 -> 64)
- Non-linearity: typically ReLU or GELU
- Up-projection: restores the original dimension (e.g., 64 -> 4096)
- Residual connection: the adapter output is added to the input
Input x (dimension d_model = 4096)
|
+----> Down-proj: W_down (4096 x 64) --> h (dimension 64)
| |
| Non-linearity (GELU)
| |
| Up-proj: W_up (64 x 4096) <------+
| |
+------- (+) ----+ Residual Connection
|
Output (dimension 4096)
Parameters per adapter: (4096 * 64) + (64 * 4096) = 524,288
With 2 adapters per layer, 32 layers: 2 * 32 * 524,288 = 33.5M parameters
6.2 Comparison: Adapters vs LoRA
Adapters vs LoRA: Key Differences
| Characteristic | Adapter Layers | LoRA |
|---|---|---|
| Inference overhead | Yes (additional layers) | No (merged into weights) |
| Additional latency | ~5-10% increase | 0% |
| Composability | High (Adapter Fusion) | Medium (LoRA addition) |
| Ease of implementation | High | High |
| Library support | AdapterHub, PEFT | PEFT, unsloth, axolotl |
7. Practical Implementation with HuggingFace PEFT
Let us move to practice. In this section we will implement fine-tuning of a Transformer model using LoRA and QLoRA with the HuggingFace PEFT library. We will use Mistral 7B as the base model and adapt it for an instruction-response task.
7.1 Setup and Installation
# Install required libraries
# pip install torch transformers peft trl datasets accelerate bitsandbytes
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
)
from peft import (
LoraConfig,
get_peft_model,
prepare_model_for_kbit_training,
TaskType,
)
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
# Check available GPU
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.version.cuda}")
7.2 LoRA Configuration
# LoRA configuration for Mistral 7B
lora_config = LoraConfig(
# Rank of the low-rank decomposition
# Typical values: 4 (minimum), 8 (good), 16 (great), 32 (high), 64 (maximum)
r=16,
# Alpha: scaling factor. The effective weight is alpha/r.
# Rule of thumb: alpha = 2 * r
lora_alpha=32,
# Dropout applied to LoRA layers during training
# Helps prevent overfitting, especially with small datasets
lora_dropout=0.05,
# Transformer modules to apply LoRA to
# For Mistral/LLaMA: q_proj, k_proj, v_proj, o_proj (attention)
# gate_proj, up_proj, down_proj (FFN)
target_modules=[
"q_proj", # Query projection - high impact
"k_proj", # Key projection - high impact
"v_proj", # Value projection - high impact
"o_proj", # Output projection - medium impact
"gate_proj", # FFN gate - high impact
"up_proj", # FFN up projection - medium-high impact
"down_proj", # FFN down projection - medium impact
],
# Task type
task_type=TaskType.CAUSAL_LM,
# Bias: "none" (recommended), "all", "lora_only"
bias="none",
)
# Print configuration summary
print(f"Rank: {lora_config.r}")
print(f"Alpha: {lora_config.lora_alpha}")
print(f"Scaling: {lora_config.lora_alpha / lora_config.r}")
print(f"Target modules: {lora_config.target_modules}")
print(f"Dropout: {lora_config.lora_dropout}")
7.3 Fine-Tuning with LoRA (FP16)
model_name = "mistralai/Mistral-7B-v0.3"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
# Load model in FP16
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto", # Automatic distribution across available GPUs
attn_implementation="flash_attention_2", # Flash Attention for efficiency
)
# Apply LoRA to the model
model = get_peft_model(model, lora_config)
# Show trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 27,262,976 || all params: 7,268,633,600 || trainable%: 0.375%
# Training configuration
training_args = SFTConfig(
output_dir="./results/mistral-lora",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size: 4 * 4 = 16
gradient_checkpointing=True, # Save VRAM by trading compute time
optim="adamw_torch",
learning_rate=2e-4, # Higher LR than full FT
lr_scheduler_type="cosine",
warmup_ratio=0.03, # 3% of steps as warmup
weight_decay=0.001,
max_grad_norm=0.3,
logging_steps=10,
save_strategy="steps",
save_steps=100,
max_seq_length=2048,
fp16=True,
report_to="wandb", # Logging to Weights & Biases
seed=42,
)
7.4 Fine-Tuning with QLoRA (4-bit)
# 4-bit quantization configuration (QLoRA)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, # Load weights in 4-bit
bnb_4bit_quant_type="nf4", # NormalFloat 4-bit (better than INT4)
bnb_4bit_compute_dtype=torch.bfloat16, # Compute in BF16 (more stable than FP16)
bnb_4bit_use_double_quant=True, # Double quantization (saves ~0.4 bits/param)
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
attn_implementation="flash_attention_2",
)
# Prepare model for k-bit training
# This freezes the quantized weights and prepares layers for LoRA
model = prepare_model_for_kbit_training(
model,
use_gradient_checkpointing=True,
)
# Apply LoRA (same configuration as before)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 27,262,976 || all params: 3,778,682,880 || trainable%: 0.721%
# Note: the base model now takes ~3.8GB instead of ~14GB
# QLoRA training configuration
training_args = SFTConfig(
output_dir="./results/mistral-qlora",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
optim="paged_adamw_8bit", # Paged optimizer to handle memory spikes
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
weight_decay=0.001,
max_grad_norm=0.3,
logging_steps=10,
save_strategy="steps",
save_steps=100,
max_seq_length=2048,
bf16=True, # BF16 for compute (more stable)
report_to="wandb",
seed=42,
)
8. Dataset Preparation
Fine-tuning quality depends enormously on dataset quality. In this section we will see how to prepare data in the standard formats used for LLM fine-tuning.
8.1 Alpaca Format
The Alpaca format (Stanford, 2023) is one of the most popular formats for fine-tuning LLMs on instruction-response tasks. Each example has three fields:
def format_alpaca(example):
"""Convert an example to Alpaca format for fine-tuning."""
if example.get("input", ""):
# With additional input
text = (
f"### Instruction:\n{example['instruction']}\n\n"
f"### Input:\n{example['input']}\n\n"
f"### Response:\n{example['output']}"
)
else:
# Instruction and response only
text = (
f"### Instruction:\n{example['instruction']}\n\n"
f"### Response:\n{example['output']}"
)
return {"text": text}
# Load dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train")
dataset = dataset.map(format_alpaca)
# Example formatted output
print(dataset[0]["text"])
# ### Instruction:
# Give three tips for staying healthy.
#
# ### Response:
# 1. Eat a balanced and nutritious diet...
# 2. Exercise regularly...
# 3. Get enough sleep...
8.2 Chat Template (Mistral/ChatML)
For conversational models, the chat template format is more appropriate. Each model has its own specific template:
def format_chat_template(example, tokenizer):
"""Format the example using the model's chat template."""
messages = [
{"role": "system", "content": "You are a helpful expert assistant."},
{"role": "user", "content": example["instruction"]},
{"role": "assistant", "content": example["output"]},
]
# Apply the tokenizer's chat template
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False,
)
return {"text": text}
# Resulting format for Mistral:
# <s>[INST] You are a helpful expert assistant. [/INST]
# [INST] Give three tips for staying healthy. [/INST]
# 1. Eat a balanced diet...</s>
# For ChatML (used by many models):
# <|im_start|>system
# You are a helpful expert assistant.<|im_end|>
# <|im_start|>user
# Give three tips for staying healthy.<|im_end|>
# <|im_start|>assistant
# 1. Eat a balanced diet...<|im_end|>
8.3 Training with SFTTrainer
from trl import SFTTrainer
# Formatted dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train")
dataset = dataset.map(
lambda x: format_chat_template(x, tokenizer)
)
# Train/eval split
dataset = dataset.train_test_split(test_size=0.05, seed=42)
# Create trainer
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
processing_class=tokenizer,
)
# Start training
trainer.train()
# Save LoRA adapter (only ~50MB)
trainer.save_model("./results/mistral-qlora/final")
tokenizer.save_pretrained("./results/mistral-qlora/final")
9. Weight Merging and Deployment
After training, we have a frozen base model and a small LoRA adapter. For deployment, we can keep them separate (useful for switching between different adaptations) or merge them into a single model.
9.1 Merging LoRA Weights into the Base Model
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model (FP16 for merging)
base_model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.3",
torch_dtype=torch.float16,
device_map="auto",
)
# Load LoRA adapter
model = PeftModel.from_pretrained(
base_model,
"./results/mistral-qlora/final",
)
# Merge LoRA into base weights
# After merge: W_merged = W_0 + (alpha/r) * B * A
model = model.merge_and_unload()
# Save merged model
model.save_pretrained("./models/mistral-7b-finetuned")
tokenizer.save_pretrained("./models/mistral-7b-finetuned")
# Upload to HuggingFace Hub
model.push_to_hub("username/mistral-7b-finetuned")
tokenizer.push_to_hub("username/mistral-7b-finetuned")
print("Model merged and uploaded to HuggingFace Hub!")
9.2 Inference with the Fine-Tuned Model
from transformers import pipeline
# Generation pipeline
pipe = pipeline(
"text-generation",
model="./models/mistral-7b-finetuned",
torch_dtype=torch.float16,
device_map="auto",
)
# Generate
messages = [
{"role": "system", "content": "You are an expert programming assistant."},
{"role": "user", "content": "Explain the Repository pattern in Python."},
]
output = pipe(
messages,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
print(output[0]["generated_text"][-1]["content"])
10. Hyperparameter Tuning
Choosing the right hyperparameters is crucial for good results. Here is a practical guide based on community experience and published benchmarks.
LoRA Hyperparameter Guide
| Hyperparameter | Range | Recommended | Notes |
|---|---|---|---|
| rank (r) | 4-256 | 16-64 | Increase for complex tasks; r=8 often sufficient for classification |
| alpha | 8-128 | 2 * rank | alpha/r is the effective scaling; keep ratio constant when changing r |
| learning_rate | 1e-5 - 5e-4 | 2e-4 | Higher than full FT (10-100x); reduce if loss oscillates |
| batch_size | 1-32 | 4-8 | Use gradient_accumulation to simulate larger batches |
| epochs | 1-5 | 2-3 | Watch for overfitting; monitor eval loss |
| warmup_ratio | 0.01-0.1 | 0.03 | Important for stability; higher with high LR |
| dropout | 0.0-0.1 | 0.05 | 0.0 for large datasets, 0.1 for small datasets |
| max_seq_length | 512-8192 | 2048 | Larger = more VRAM; adapt to dataset |
Common Fine-Tuning Mistakes
- Learning rate too high: loss oscillates or diverges. Solution: reduce LR by 2-5x or increase warmup
- Rank too low: the model does not learn enough. Solution: increase r from 8 to 16 or 32
- Rank too high: overfitting, especially with small datasets. Solution: reduce r or increase dropout
- Too few epochs: underfitting. Check whether eval loss is still decreasing
- Too many epochs: training loss drops but eval loss rises (overfitting)
- Unclean dataset: duplicates, errors, inconsistent formatting degrade quality
- Forgetting gradient checkpointing: immediate OOM on large models
11. Benchmarks and Comparisons
How do you choose between LoRA, QLoRA, DoRA and full fine-tuning? Here is a systematic comparison based on published benchmarks and community testing.
Complete Comparison: LoRA vs QLoRA vs DoRA vs Full FT
| Metric | Full FT | LoRA | QLoRA | DoRA |
|---|---|---|---|---|
| Quality (MT-Bench) | 7.8 | 7.5 | 7.3 | 7.6 |
| VRAM (7B model) | ~100 GB | ~26 GB | ~10 GB | ~26 GB |
| Training speed (rel.) | 1.0x | 1.2x | 0.8x | 1.1x |
| Trainable parameters | 100% | 0.1-0.5% | 0.1-0.5% | 0.1-0.5% + m |
| Adaptation storage | 14 GB | 10-50 MB | 10-50 MB | 10-50 MB |
| Inference overhead | 0% | 0% (merged) | 0% (merged) | 0% (merged) |
| Minimum GPU (7B) | 4x A100 | 1x A100 40GB | 1x RTX 3090 | 1x A100 40GB |
12. Hardware Requirements
Hardware choice depends on the model you want to fine-tune and the PEFT technique you use. Here is a practical guide for consumer and professional GPUs.
GPU Requirements for Fine-Tuning
| GPU | VRAM | LoRA (FP16) | QLoRA (4-bit) | Notes |
|---|---|---|---|---|
| RTX 3060 | 12 GB | up to 3B | up to 7B (seq 512) | Entry-level, limited |
| RTX 3090 | 24 GB | up to 7B | up to 13B | Great for QLoRA 7B |
| RTX 4090 | 24 GB | up to 7B | up to 13B | Faster than 3090, same VRAM |
| A100 40GB | 40 GB | up to 13B | up to 34B | Professional standard |
| A100 80GB | 80 GB | up to 30B | up to 70B | Ideal for large models |
| H100 80GB | 80 GB | up to 30B | up to 70B | Faster than A100, FP8 support |
Tips for Consumer GPUs
- Limited budget (RTX 3060 12GB): QLoRA on 7B models with seq_length=512, batch_size=1, gradient_accumulation=16
- Good value (RTX 3090/4090 24GB): QLoRA on 7-13B models with seq_length=2048, batch_size=4
- Cloud computing: Google Colab Pro ($10/month) offers A100 40GB for limited sessions; RunPod and Lambda Labs for intensive use
- Always enable: gradient_checkpointing=True, paged optimizers, Flash Attention 2
13. Practical Use Cases
Let us look at some concrete use cases where fine-tuning with LoRA is particularly effective.
13.1 Text Classification
For classification tasks (sentiment analysis, topic classification, spam detection), LoRA is often preferable because:
- The task requires few additional parameters (rank r=4-8 is sufficient)
- Classification datasets are typically small (1k-50k examples)
- The risk of overfitting with high rank is elevated
13.2 Code Generation
For fine-tuning on code generation (e.g., adapting a model to a specific language or corporate conventions), the recommendation is:
- Higher rank (r=32-64) because code has more structure and variability
- Complete target modules (all 7 Transformer modules)
- High-quality dataset: well-documented code, with tests, consistent style
- Long sequences (max_seq_length=4096-8192) for complete context
13.3 Domain-Specific Chatbot
To create a specialized chatbot (medical, legal, technical support):
- Conversational dataset in the model's chat template format
- Include refusal examples ("I cannot answer this question")
- Medium rank (r=16-32)
- Validation with domain experts, not just automatic metrics
13.4 Summarization
For fine-tuning on summarization tasks:
- Dataset with high-quality (document, summary) pairs
- Long sequences for input (up to 8192 tokens)
- Medium-high rank (r=16-32)
- Evaluation with metrics like ROUGE and human evaluation
14. Conclusions and Decision Tree
PEFT techniques have made fine-tuning of large language models accessible to anyone with a consumer GPU. LoRA, QLoRA and DoRA represent the state of the art for parameter-efficient fine-tuning, each with its own strengths.
Decision Tree: When to Use What
| Situation | Recommended Technique | Reason |
|---|---|---|
| GPU with < 16 GB VRAM | QLoRA | Only option for 7B+ models on consumer hardware |
| GPU with 24-40 GB VRAM | LoRA (FP16) | Better quality than QLoRA, faster training |
| Maximum quality required | DoRA | Closest to full FT, minimal overhead on LoRA |
| Multiple adaptations of the same model | LoRA | Small adapters (~50MB), fast switching between tasks |
| Small model (< 1B parameters) | Full Fine-Tuning | For small models, full FT is often feasible and better |
| Simple task (classification) | LoRA (r=4-8) | Low rank is sufficient; avoids overfitting |
| Complex task (code generation) | LoRA/DoRA (r=32-64) | High rank to capture task complexity |
| Skill composition | Adapter Fusion | Combines multiple adaptations in a structured way |
The field of efficient fine-tuning is evolving rapidly. New techniques such as GaLore (Gradient Low-Rank Projection) and ReLoRA (iterative LoRA training with increasing rank) promise to further narrow the gap with full fine-tuning. However, LoRA and QLoRA remain the de facto standard for LLM fine-tuning today, with a mature ecosystem of libraries (HuggingFace PEFT, Unsloth, Axolotl) and an active community.
In the next article of the series we will explore Model Quantization: GPTQ, AWQ, INT8, and how to reduce model size by 75% while maintaining quality for production deployment.
Resources and References
- LoRA paper: "LoRA: Low-Rank Adaptation of Large Language Models" (Hu et al., 2021)
- QLoRA paper: "QLoRA: Efficient Finetuning of Quantized LLMs" (Dettmers et al., 2023)
- DoRA paper: "DoRA: Weight-Decomposed Low-Rank Adaptation" (Liu et al., 2024)
- Adapters paper: "Parameter-Efficient Transfer Learning" (Houlsby et al., 2019)
- HuggingFace PEFT: https://github.com/huggingface/peft
- Unsloth: https://github.com/unslothai/unsloth (LoRA 2-5x faster)
- TRL (Transformer Reinforcement Learning): https://github.com/huggingface/trl







