Hi! I'm

Federico Calò

Software Developer | Technical Writer

I create modern web applications and custom digital tools to help businesses grow through technological innovation. My passion is combining computer science and economics to generate real value.

Contact Me

About Me

My passion for computer science was born at the Technical Commercial Institute of Maglie, where I discovered the power of programming and the fascination of creating digital solutions. From the start, I understood that computer science was not just code, but an extraordinary tool for turning ideas into reality.

During my studies in Business Information Systems, I began to interweave computer science and economics, understanding how technology can be the engine of growth for any business. This vision accompanied me to the University of Bari, where I obtained my degree in Computer Science, deepening my technical skills and passion for software development.

Today I put this experience at the service of businesses, professionals and startups, creating tailor-made digital solutions that automate processes, optimize resources and open new business opportunities. Because true innovation begins when technology meets the real needs of people.

My Skills

Data Analysis & Predictive Models

I transform data into strategic insights with in-depth analysis and predictive models for informed decisions

Process Automation

I create custom tools that automate repetitive operations and free up time for value-added activities

Custom Systems

I develop tailor-made software systems, from platform integrations to customized dashboards

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Che tu gestisca un negozio, uno studio professionale o un'azienda, posso aiutarti a sfruttare le potenzialità dell'informatica per lavorare meglio, più velocemente e in modo più intelligente.

Parliamone Insieme →

Join the Community

Join the developer community where we discuss software, AI, architecture and DevOps. Share ideas, ask questions and grow with us.

Channel

FC Dev Blog

Get notifications on new articles, complete series, weekly tips and featured tools. Bilingual IT/EN content directly in your Telegram.

New articles as they are published
Weekly tips and code snippets
Polls on future topics

Subscribe to Channel

Group

FC Dev Community

A bilingual IT/EN community for developers. Discussions, Q&A, mutual help and networking with other professionals.

Discussions on articles and technologies
Coding help and code review
Job opportunities and collaboration

Join the Group

Discussion Topics

View

Master SQL

RoadMap.sh

November 2024

View

Oracle Certified Foundations Associate

Oracle

October 2024

View

People Leadership Credential

Connect

September 2024

💻 Languages & Technologies

Java

Python

JavaScript

Angular

React

TypeScript

SQL

PHP

CSS/SCSS

Node.js

Docker

Git

💼

12/2024 - Present

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italy · Hybrid Analysis and development of computer systems through the use of Java and Quarkus in Health and Public Sector. Continuous training on modern technologies for creating customized and efficient software solutions and on agents.

💼

06/2022 - 12/2024

Software analyst and Back End Developer Associate Consultant

Links Management and Technology SpA

Experience analyzing as-is software systems and ETL flows using PowerCenter. Completed Spring Boot training for developing modern and scalable backend applications. Backend developer specialized in Spring Boot, with experience in database design, analysis, development and testing of assigned tasks.

💼

02/2021 - 10/2021

Software programmer

Adesso.it (prima era WebScience srl)

Experience in AS-IS and TO-BE analysis, SEO evolutions and website evolutions to improve user performance and engagement.

🎓

2018 - 2025

Degree in Computer Science

University of Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Corporate Information Systems

Technical Commercial Institute of Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

02 - Fine-Tuning Transformers: LoRA, QLoRA, DoRA

Pre-trained Transformer models such as Llama 3, Mistral, GPT-4 and Claude possess impressive knowledge of language and reasoning, yet they are rarely ready out of the box for a specific task. To adapt them to our domain, whether it is classifying corporate emails, generating code in a proprietary framework or answering medical questions, we need fine-tuning.

The problem is that these models have billions of parameters: LLaMA 2 has 7 billion in its smallest version, 70 billion in its largest. A full fine-tuning run on LLaMA-7B requires roughly 28 GB of VRAM just for the weights in FP32, plus just as much for the Adam optimizer states (first and second moments), plus memory for gradients and activations. In practice you need 4x A100 80 GB GPUs for a single fine-tuning run.

In this second article of the Advanced Deep Learning and Edge Deployment series, we will explore Parameter-Efficient Fine-Tuning (PEFT) techniques that allow you to adapt models with billions of parameters using a fraction of the resources. We will start from the problem with full fine-tuning, then analyze LoRA, QLoRA, DoRA and Adapter Layers in detail, with complete Python implementations using HuggingFace PEFT.

Series Overview

#	Article	Focus
1	Attention Mechanism in Transformers	Self-attention, multi-head, full architecture
2	You are here - Fine-Tuning with LoRA, QLoRA and Adapters	Parameter-efficient fine-tuning
3	Model Quantization	INT8, INT4, GPTQ, AWQ
4	Pruning and Compression	Parameter reduction, distillation
5	Knowledge Distillation	Teacher-student, knowledge transfer
6	Ollama and Local LLMs	Local inference, optimization
7	Vision Transformers	ViT, DINO, image classification
8	Edge Deployment	ONNX, TensorRT, mobile devices
9	NAS and AutoML	Neural Architecture Search
10	Benchmarks and Optimization	Profiling, metrics, tuning

What You Will Learn

Why full fine-tuning is unsustainable for models with billions of parameters
A complete overview of PEFT techniques: prefix tuning, prompt tuning, adapters, LoRA
The mathematics of LoRA: low-rank decomposition, alpha scaling, target modules
QLoRA: how to combine 4-bit quantization with LoRA for fine-tuning on consumer GPUs
DoRA: the weight decomposition into magnitude and direction that improves on LoRA
Adapter Layers: bottleneck adapters, parallel adapters, adapter fusion
Complete practical implementation with HuggingFace PEFT and SFTTrainer
Dataset preparation: Alpaca format, chat template, conversational format
Hyperparameter tuning: rank, alpha, learning rate, batch size
Comparison of LoRA vs QLoRA vs DoRA vs full fine-tuning with real benchmarks
Hardware requirements and consumer GPU limitations

1. The Problem with Full Fine-Tuning

Full fine-tuning consists of updating all parameters of a pre-trained model during training on a specific dataset. Every weight, from embedding layers to attention heads, is modified through backpropagation. This approach worked well for moderately sized models like BERT (110M parameters), but becomes unsustainable when scaling to billions of parameters.

1.1 Computational Costs

To understand why full fine-tuning is prohibitive, let us analyze the memory requirements for LLaMA 2 7B:

      Full Fine-Tuning Memory Footprint (LLaMA 2 7B)
      
            Component
            Formula
            Memory
          
            Model weights (FP16)
            7B * 2 bytes
            14 GB
          
            Gradients (FP16)
            7B * 2 bytes
            14 GB
          
            Optimizer states (Adam FP32)
            7B * 4 bytes * 2 (m, v)
            56 GB
          
            Activations (batch_size=1)
            Variable
            ~8-16 GB
          
            Total
            
            ~92-106 GB

No consumer GPU can handle 100 GB of VRAM. Even the A100 80 GB is insufficient without techniques like gradient checkpointing and mixed precision. For LLaMA 70B, we are talking about over 1 TB of required memory.

1.2 Catastrophic Forgetting

The second problem with full fine-tuning is catastrophic forgetting. When we update all parameters on a specific dataset, the model tends to forget the knowledge acquired during pre-training. A model fine-tuned on legal documents might lose its ability to generate Python code or answer history questions.

This happens because full fine-tuning indiscriminately modifies all weights, including those that encode general language knowledge. In practice, the model overwrites its previous knowledge with the new one.

1.3 Storage for Multiple Models

If an organization needs to adapt the same base model to 10 different tasks (email classification, sentiment analysis, medical Q&A, report generation, etc.), full fine-tuning requires saving 10 complete copies of the model: for LLaMA 7B, that is 140 GB of storage. With LoRA, as we will see, each adaptation requires only 10-50 MB, for a total of 100-500 MB.

Why Full Fine-Tuning Does Not Scale

Problem	Full Fine-Tuning	PEFT (LoRA)
VRAM for LLaMA 7B	~100 GB	~16-24 GB
Trainable parameters	7,000,000,000	~4,000,000 (0.06%)
Storage per adaptation	14 GB	10-50 MB
Catastrophic forgetting	High risk	Minimal risk
Hardware required	4x A100 80GB	1x RTX 3090 24GB

2. Parameter-Efficient Fine-Tuning (PEFT): Overview

PEFT techniques share a fundamental principle: instead of updating all model parameters, we freeze the pre-trained weights and train only a small subset of additional parameters. This drastically reduces memory requirements, speeds up training and preserves the base model's knowledge.

      PEFT Techniques Taxonomy
      
        
            Technique
            Where It Acts
            Extra Parameters
            Advantages
          

        
            Prompt Tuning
            Input embeddings
            Soft tokens prepended to input
            Very simple, few parameters
          

            Prefix Tuning
            Every attention layer
            Trainable K,V prefixes
            More expressive than prompt tuning
          

            Adapter Layers
            Between Transformer layers
            Inserted bottleneck modules
            Flexible, composable
          

            LoRA
            Existing weight matrices
            Low-rank matrices B and A
            Zero inference overhead
          

            QLoRA
            Like LoRA + quantization
            LoRA on 4-bit model
            Fine-tuning on consumer GPUs
          

            DoRA
            Like LoRA + decomposition
            Magnitude + direction
            Higher quality than LoRA
          

      
    

2.1 Prompt Tuning

Prompt tuning (Lester et al., 2021) adds a set of trainable soft tokens at the beginning of the input. These tokens do not correspond to real words in the vocabulary but are continuous vectors that the model learns during training. The base model remains completely frozen: only the soft token embeddings are trained.

Prompt Tuning: Conceptual Diagram


Original input:  [Classify this email: "Meeting at 3pm"]

With prompt tuning: [T1][T2][T3]...[T20] [Classify this email: "Meeting at 3pm"]
                    ^^^^^^^^^^^^^^^^^
                    20 trainable soft tokens
                    (20 * d_model parameters = 20 * 4096 = 81,920 parameters)

The base model (7B parameters) is FROZEN.
Only 81,920 parameters are updated.

Prompt tuning works surprisingly well for very large models (> 10B parameters), but loses quality with smaller models. With T5-XXL (11B), prompt tuning achieves performance nearly identical to full fine-tuning on SuperGLUE.

2.2 Prefix Tuning

Prefix tuning (Li & Liang, 2021) extends the prompt tuning idea: instead of adding soft tokens only to the input, it adds trainable prefixes to the keys (K) and values (V) of every attention layer. This gives the model more expressive power compared to simple prompt tuning.

Prefix Tuning vs Prompt Tuning


Prompt Tuning:
  Layer 1 Attention: Q=[input], K=[input], V=[input]
  Layer 2 Attention: Q=[input], K=[input], V=[input]
  Only soft tokens prepended to the initial input.

Prefix Tuning:
  Layer 1 Attention: Q=[input], K=[prefix_1 | input], V=[prefix_1 | input]
  Layer 2 Attention: Q=[input], K=[prefix_2 | input], V=[prefix_2 | input]
  ...
  Layer L Attention: Q=[input], K=[prefix_L | input], V=[prefix_L | input]
  Trainable prefixes at EVERY attention layer.

Parameters: L * prefix_len * 2 * d_model  (2 for K and V)
Example: 32 layers * 20 prefix * 2 * 4096 = 5,242,880 parameters

3. LoRA: Low-Rank Adaptation

LoRA (Hu et al., 2021) is the most widely used PEFT technique and represents a turning point in fine-tuning large language models. The key idea is elegant in its simplicity: instead of directly updating the Transformer weight matrices, we decompose the update into two low-rank matrices.

3.1 The Intuition: Intrinsic Rank Hypothesis

Research (Aghajanyan et al., 2020) has shown that when we fine-tune on a specific task, the weight updates concentrate in a low-dimensional subspace. In other words, the weight update matrix has an intrinsic rank much lower than its nominal dimension.

Consider a weight matrix W of dimensions d x k (for example, the query projection in the attention of LLaMA 7B: 4096 x 4096). The full update would require a matrix of 4096 x 4096 = 16,777,216 parameters. But if the update has a low intrinsic rank, we can approximate it with a much smaller rank r.

3.2 The Mathematics of LoRA

Given a pre-trained weight matrix W_0 of dimensions d x k, LoRA parametrizes the update as the product of two low-rank matrices:

Fundamental LoRA Formula

$W = W_0 + \Delta W = W_0 + B \cdot A$

Where:

$W_0 \in \mathbb{R}^{d \times k}$ is the pre-trained weight matrix (frozen)
$B \in \mathbb{R}^{d \times r}$ is the down-projection matrix
$A \in \mathbb{R}^{r \times k}$ is the up-projection matrix
$r \ll \min(d, k)$ is the rank (typically 4, 8, 16, 32, 64)

With scaling: $h = W_0 x + \frac{\alpha}{r} \cdot B A x$

where $\alpha$ is a scaling factor that controls the intensity of the adaptation.

LoRA: Decomposition Visualization


Original matrix W_0 (frozen):

    k = 4096
    +------------------+
    |                  |
d = |   W_0            |  16,777,216 parameters
4096|   (FROZEN)       |  Not updated
    |                  |
    +------------------+

LoRA update (trainable):

    r = 16             k = 4096
    +-----+            +------------------+
    |     |            |                  |
d = | B   |  x   r=16 |       A          |
4096|     |            |                  |
    |     |            +------------------+
    +-----+
    65,536              65,536

    B (d x r)           A (r x k)
    = 4096 * 16         = 16 * 4096
    = 65,536            = 65,536

Total LoRA: 131,072 parameters (0.78% of 16,777,216)

3.3 Initialization

The initialization of A and B is crucial. LoRA initializes:

A with Kaiming uniform initialization (Gaussian distribution)
B with all zeros

This guarantees that at the start of training the LoRA contribution is exactly zero (B * A = 0), so the model starts from the behavior of the pre-trained model. Training gradually modifies B and A to adapt the model to the specific task.

3.4 Scaling Factor Alpha

The parameter $\alpha$ (alpha) controls the intensity of the LoRA adaptation. The effective output is:

$h = W_0 x + \frac{\alpha}{r} \cdot B A x$

The ratio $\alpha / r$ works as a learning rate for LoRA. In practice, $\alpha = 2r$ is often used (for example $\alpha = 32$ with $r = 16$ ), yielding a scaling factor of 2. This allows changing the rank without having to readjust the learning rate.

3.5 Target Modules

LoRA is not applied to all Transformer layers, but only to specific weight matrices. The choice of target modules significantly influences fine-tuning quality:

      Target Modules for Common Architectures
      
        
            Module
            Type
            Dimensions (LLaMA 7B)
            Impact
          

        
            q_proj
            Attention Query
            4096 x 4096
            High - controls what the model "looks for"
          

            k_proj
            Attention Key
            4096 x 4096
            High - controls how tokens are indexed
          

            v_proj
            Attention Value
            4096 x 4096
            High - controls extracted information
          

            o_proj
            Attention Output
            4096 x 4096
            Medium
          

            gate_proj
            FFN Gate
            4096 x 11008
            High - controls information flow in FFN
          

            up_proj
            FFN Up
            4096 x 11008
            Medium-High
          

            down_proj
            FFN Down
            11008 x 4096
            Medium
          

      
    

3.6 Exact Calculation of Trainable Parameters

Let us calculate the trainable parameters for LLaMA 2 7B with LoRA applied to q_proj and v_proj across all 32 layers:

LoRA Parameter Calculation


LLaMA 2 7B: 32 layers, d_model = 4096

With r = 16, target_modules = ["q_proj", "v_proj"]:

Per layer:
  q_proj LoRA: B (4096 x 16) + A (16 x 4096) = 65,536 + 65,536 = 131,072
  v_proj LoRA: B (4096 x 16) + A (16 x 4096) = 65,536 + 65,536 = 131,072
  Total per layer: 262,144

Total: 32 layers * 262,144 = 8,388,608 parameters (8.4M)

Ratio: 8.4M / 6,738M (total LLaMA 7B) = 0.12% of parameters

With target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]:
  Per layer: 7 modules * 131,072 (average) = ~917,504
  Total: 32 * 917,504 = ~29.4M parameters = 0.44%

3.7 Inference Advantage: Zero Overhead

The key advantage of LoRA over Adapter Layers is that at inference time there is no computational overhead. The LoRA matrices can be merged into the base model weights:

$W_{merged} = W_0 + \frac{\alpha}{r} \cdot B \cdot A$

After merging, the model has exactly the same size and inference speed as the original model, but with updated weights. This is impossible with adapter layers, which add permanent parameters to the model.

4. QLoRA: Quantized LoRA

QLoRA (Dettmers et al., 2023) is an extension of LoRA that combines 4-bit quantization of the base model with LoRA fine-tuning. This makes it possible to fine-tune models with 65 billion parameters on a single 48 GB GPU, a result that was previously impossible.

4.1 The Three Innovations of QLoRA

QLoRA introduces three key techniques:

      QLoRA Innovations
      
          4-bit NormalFloat (NF4): A new quantized data type optimized for
          normally distributed weights. Each weight is mapped to one of 16 values (4 bits)
          chosen to minimize quantization error on a Gaussian distribution. NF4 outperforms
          INT4 and FP4 because neural network weights approximately follow a normal
          distribution.
        

          Double Quantization: The quantization constants (one per block of
          64 weights) are themselves quantized to 8-bit. This saves an additional 0.37 bits
          per parameter, roughly 3 GB for a 65B model.
        

          Paged Optimizers: Uses NVIDIA GPU paged memory (unified memory)
          to handle memory spikes during training, automatically swapping optimizer states
          between GPU and CPU when needed.
        

    

4.2 How NF4 Works

NF4 quantization is based on the observation that weights of well-trained neural networks follow an approximately normal distribution with zero mean. NF4 divides the standard normal distribution into 16 equal-probability intervals and assigns each interval the optimal value that minimizes the expected error:

NF4 vs INT4: Quantization Level Distribution


INT4 (evenly spaced):
  Levels: [-8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7]
  Problem: many levels in the tails (few weights there), few at center (many weights there)

NF4 (normal quantiles):
  Levels: [-1.0, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911, 0.0,
            0.0796, 0.1609, 0.2461, 0.3379, 0.4407, 0.5626, 0.7230, 1.0]
  Advantage: dense levels where most weights are (near 0)

Result: NF4 produces ~2x lower quantization error than INT4

4.3 QLoRA Memory Footprint

      Memory Comparison: Full FT vs LoRA vs QLoRA (LLaMA 7B)
      
        
            Component
            Full FT (FP16)
            LoRA (FP16)
            QLoRA (NF4)
          

        
            Model weights
            14 GB (FP16)
            14 GB (FP16)
            3.5 GB (NF4)
          

            Gradients
            14 GB
            ~16 MB
            ~16 MB
          

            Optimizer states
            56 GB
            ~64 MB
            ~64 MB
          

            Activations
            ~12 GB
            ~12 GB
            ~6 GB
          

            Total
            ~96 GB
            ~26 GB
            ~10 GB
          

      
    

With QLoRA, fine-tuning LLaMA 7B becomes possible on a single RTX 3090 (24 GB), and LLaMA 13B on a single RTX 4090 (24 GB) with gradient checkpointing enabled.

5. DoRA: Weight-Decomposed Low-Rank Adaptation

DoRA (Liu et al., 2024) is an evolution of LoRA that decomposes the weight matrix into two components: magnitude and direction. This decomposition is inspired by the classical Weight Normalization from Salimans & Kingma (2016) and aims to bridge the quality gap between LoRA and full fine-tuning.

5.1 The Intuition Behind DoRA

Analysis of learning patterns shows a fundamental difference between full fine-tuning and LoRA: full fine-tuning modifies both the magnitude and direction of weights independently, while LoRA tends to modify them proportionally, limiting its expressiveness.

5.2 The Mathematics of DoRA

DoRA decomposes the pre-trained weight matrix into:

DoRA Formula

$W' = m \cdot \frac{W_0 + BA}{\|W_0 + BA\|_c}$

Where:

$m \in \mathbb{R}^{1 \times k}$ is the magnitude vector (trainable), initialized with the column norms of $W_0$
$W_0 + BA$ represents the direction updated via LoRA
$\|\cdot\|_c$ is the column-wise norm, normalizing each column to unit norm
$B$ and $A$ are the standard LoRA matrices (trainable)

In this way, DoRA has two independent degrees of freedom:

The direction is controlled by B and A (exactly as in LoRA)
The magnitude is controlled by the vector m, which can change independently from the direction

5.3 DoRA Advantages Over LoRA

DoRA vs LoRA Benchmark (Commonsense Reasoning, LLaMA-7B)

Task	LoRA (r=32)	DoRA (r=32)	Full FT
BoolQ	69.8	71.8	73.2
PIQA	82.1	83.2	83.9
WinoGrande	79.4	80.6	81.5
HellaSwag	83.6	84.4	85.1
Average	78.7	80.0	80.9

DoRA closes the gap between LoRA and full fine-tuning by approximately 60-70%, with minimal parameter overhead (only the additional m vector).

6. Adapter Layers

Adapter Layers (Houlsby et al., 2019) were among the first PEFT techniques proposed. The idea is simple: insert small trainable modules (adapters) between the existing Transformer layers, while keeping the original weights frozen.

6.1 Bottleneck Adapters

A classic adapter is a bottleneck module with three components:

Down-projection: reduces dimension from d_model to d_bottleneck (e.g., 4096 -> 64)
Non-linearity: typically ReLU or GELU
Up-projection: restores the original dimension (e.g., 64 -> 4096)
Residual connection: the adapter output is added to the input

Bottleneck Adapter: Flow


Input x (dimension d_model = 4096)
    |
    +----> Down-proj: W_down (4096 x 64)  -->  h (dimension 64)
    |                                          |
    |                                     Non-linearity (GELU)
    |                                          |
    |      Up-proj: W_up (64 x 4096)    <------+
    |                |
    +------- (+) ----+  Residual Connection
    |
Output (dimension 4096)

Parameters per adapter: (4096 * 64) + (64 * 4096) = 524,288
With 2 adapters per layer, 32 layers: 2 * 32 * 524,288 = 33.5M parameters

6.2 Comparison: Adapters vs LoRA

      Adapters vs LoRA: Key Differences
      
            Characteristic
            Adapter Layers
            LoRA
          
            Inference overhead
            Yes (additional layers)
            No (merged into weights)
          
            Additional latency
            ~5-10% increase
            0%
          
            Composability
            High (Adapter Fusion)
            Medium (LoRA addition)
          
            Ease of implementation
            High
            High
          
            Library support
            AdapterHub, PEFT
            PEFT, unsloth, axolotl

7. Practical Implementation with HuggingFace PEFT

Let us move to practice. In this section we will implement fine-tuning of a Transformer model using LoRA and QLoRA with the HuggingFace PEFT library. We will use Mistral 7B as the base model and adapt it for an instruction-response task.

7.1 Setup and Installation

Installing dependencies

# Install required libraries
# pip install torch transformers peft trl datasets accelerate bitsandbytes

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import (
    LoraConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
    TaskType,
)
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

# Check available GPU
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.version.cuda}")

7.2 LoRA Configuration

lora_config.py - Detailed LoRA configuration

# LoRA configuration for Mistral 7B
lora_config = LoraConfig(
    # Rank of the low-rank decomposition
    # Typical values: 4 (minimum), 8 (good), 16 (great), 32 (high), 64 (maximum)
    r=16,

    # Alpha: scaling factor. The effective weight is alpha/r.
    # Rule of thumb: alpha = 2 * r
    lora_alpha=32,

    # Dropout applied to LoRA layers during training
    # Helps prevent overfitting, especially with small datasets
    lora_dropout=0.05,

    # Transformer modules to apply LoRA to
    # For Mistral/LLaMA: q_proj, k_proj, v_proj, o_proj (attention)
    #                    gate_proj, up_proj, down_proj (FFN)
    target_modules=[
        "q_proj",    # Query projection - high impact
        "k_proj",    # Key projection - high impact
        "v_proj",    # Value projection - high impact
        "o_proj",    # Output projection - medium impact
        "gate_proj", # FFN gate - high impact
        "up_proj",   # FFN up projection - medium-high impact
        "down_proj", # FFN down projection - medium impact
    ],

    # Task type
    task_type=TaskType.CAUSAL_LM,

    # Bias: "none" (recommended), "all", "lora_only"
    bias="none",
)

# Print configuration summary
print(f"Rank: {lora_config.r}")
print(f"Alpha: {lora_config.lora_alpha}")
print(f"Scaling: {lora_config.lora_alpha / lora_config.r}")
print(f"Target modules: {lora_config.target_modules}")
print(f"Dropout: {lora_config.lora_dropout}")

7.3 Fine-Tuning with LoRA (FP16)

lora_finetuning.py - Complete LoRA fine-tuning

model_name = "mistralai/Mistral-7B-v0.3"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Load model in FP16
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",          # Automatic distribution across available GPUs
    attn_implementation="flash_attention_2",  # Flash Attention for efficiency
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)

# Show trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 27,262,976 || all params: 7,268,633,600 || trainable%: 0.375%

# Training configuration
training_args = SFTConfig(
    output_dir="./results/mistral-lora",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,    # Effective batch size: 4 * 4 = 16
    gradient_checkpointing=True,      # Save VRAM by trading compute time
    optim="adamw_torch",
    learning_rate=2e-4,               # Higher LR than full FT
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,                # 3% of steps as warmup
    weight_decay=0.001,
    max_grad_norm=0.3,
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    max_seq_length=2048,
    fp16=True,
    report_to="wandb",                # Logging to Weights & Biases
    seed=42,
)

7.4 Fine-Tuning with QLoRA (4-bit)

qlora_finetuning.py - QLoRA with bitsandbytes

# 4-bit quantization configuration (QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # Load weights in 4-bit
    bnb_4bit_quant_type="nf4",            # NormalFloat 4-bit (better than INT4)
    bnb_4bit_compute_dtype=torch.bfloat16, # Compute in BF16 (more stable than FP16)
    bnb_4bit_use_double_quant=True,       # Double quantization (saves ~0.4 bits/param)
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="flash_attention_2",
)

# Prepare model for k-bit training
# This freezes the quantized weights and prepares layers for LoRA
model = prepare_model_for_kbit_training(
    model,
    use_gradient_checkpointing=True,
)

# Apply LoRA (same configuration as before)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 27,262,976 || all params: 3,778,682,880 || trainable%: 0.721%
# Note: the base model now takes ~3.8GB instead of ~14GB

# QLoRA training configuration
training_args = SFTConfig(
    output_dir="./results/mistral-qlora",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    optim="paged_adamw_8bit",         # Paged optimizer to handle memory spikes
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    weight_decay=0.001,
    max_grad_norm=0.3,
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    max_seq_length=2048,
    bf16=True,                        # BF16 for compute (more stable)
    report_to="wandb",
    seed=42,
)

8. Dataset Preparation

Fine-tuning quality depends enormously on dataset quality. In this section we will see how to prepare data in the standard formats used for LLM fine-tuning.

8.1 Alpaca Format

The Alpaca format (Stanford, 2023) is one of the most popular formats for fine-tuning LLMs on instruction-response tasks. Each example has three fields:

dataset_preparation.py - Dataset formatting

def format_alpaca(example):
    """Convert an example to Alpaca format for fine-tuning."""
    if example.get("input", ""):
        # With additional input
        text = (
            f"### Instruction:\n{example['instruction']}\n\n"
            f"### Input:\n{example['input']}\n\n"
            f"### Response:\n{example['output']}"
        )
    else:
        # Instruction and response only
        text = (
            f"### Instruction:\n{example['instruction']}\n\n"
            f"### Response:\n{example['output']}"
        )
    return {"text": text}


# Load dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train")
dataset = dataset.map(format_alpaca)

# Example formatted output
print(dataset[0]["text"])
# ### Instruction:
# Give three tips for staying healthy.
#
# ### Response:
# 1. Eat a balanced and nutritious diet...
# 2. Exercise regularly...
# 3. Get enough sleep...

8.2 Chat Template (Mistral/ChatML)

For conversational models, the chat template format is more appropriate. Each model has its own specific template:

chat_template.py - Conversational format

def format_chat_template(example, tokenizer):
    """Format the example using the model's chat template."""
    messages = [
        {"role": "system", "content": "You are a helpful expert assistant."},
        {"role": "user", "content": example["instruction"]},
        {"role": "assistant", "content": example["output"]},
    ]

    # Apply the tokenizer's chat template
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False,
    )
    return {"text": text}


# Resulting format for Mistral:
# <s>[INST] You are a helpful expert assistant. [/INST]
# [INST] Give three tips for staying healthy. [/INST]
# 1. Eat a balanced diet...</s>

# For ChatML (used by many models):
# <|im_start|>system
# You are a helpful expert assistant.<|im_end|>
# <|im_start|>user
# Give three tips for staying healthy.<|im_end|>
# <|im_start|>assistant
# 1. Eat a balanced diet...<|im_end|>

8.3 Training with SFTTrainer

sft_training.py - Complete training with SFTTrainer

from trl import SFTTrainer

# Formatted dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train")
dataset = dataset.map(
    lambda x: format_chat_template(x, tokenizer)
)

# Train/eval split
dataset = dataset.train_test_split(test_size=0.05, seed=42)

# Create trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    processing_class=tokenizer,
)

# Start training
trainer.train()

# Save LoRA adapter (only ~50MB)
trainer.save_model("./results/mistral-qlora/final")
tokenizer.save_pretrained("./results/mistral-qlora/final")

9. Weight Merging and Deployment

After training, we have a frozen base model and a small LoRA adapter. For deployment, we can keep them separate (useful for switching between different adaptations) or merge them into a single model.

9.1 Merging LoRA Weights into the Base Model

merge_and_push.py - Merge and upload to HuggingFace Hub

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model (FP16 for merging)
base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.3",
    torch_dtype=torch.float16,
    device_map="auto",
)

# Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model,
    "./results/mistral-qlora/final",
)

# Merge LoRA into base weights
# After merge: W_merged = W_0 + (alpha/r) * B * A
model = model.merge_and_unload()

# Save merged model
model.save_pretrained("./models/mistral-7b-finetuned")
tokenizer.save_pretrained("./models/mistral-7b-finetuned")

# Upload to HuggingFace Hub
model.push_to_hub("username/mistral-7b-finetuned")
tokenizer.push_to_hub("username/mistral-7b-finetuned")

print("Model merged and uploaded to HuggingFace Hub!")

9.2 Inference with the Fine-Tuned Model

inference.py - Generation with the fine-tuned model

from transformers import pipeline

# Generation pipeline
pipe = pipeline(
    "text-generation",
    model="./models/mistral-7b-finetuned",
    torch_dtype=torch.float16,
    device_map="auto",
)

# Generate
messages = [
    {"role": "system", "content": "You are an expert programming assistant."},
    {"role": "user", "content": "Explain the Repository pattern in Python."},
]

output = pipe(
    messages,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)

print(output[0]["generated_text"][-1]["content"])

10. Hyperparameter Tuning

Choosing the right hyperparameters is crucial for good results. Here is a practical guide based on community experience and published benchmarks.

      LoRA Hyperparameter Guide
      
        
            Hyperparameter
            Range
            Recommended
            Notes
          

        
            rank (r)
            4-256
            16-64
            Increase for complex tasks; r=8 often sufficient for classification
          

            alpha
            8-128
            2 * rank
            alpha/r is the effective scaling; keep ratio constant when changing r
          

            learning_rate
            1e-5 - 5e-4
            2e-4
            Higher than full FT (10-100x); reduce if loss oscillates
          

            batch_size
            1-32
            4-8
            Use gradient_accumulation to simulate larger batches
          

            epochs
            1-5
            2-3
            Watch for overfitting; monitor eval loss
          

            warmup_ratio
            0.01-0.1
            0.03
            Important for stability; higher with high LR
          

            dropout
            0.0-0.1
            0.05
            0.0 for large datasets, 0.1 for small datasets
          

            max_seq_length
            512-8192
            2048
            Larger = more VRAM; adapt to dataset
          

      
    

Common Fine-Tuning Mistakes

Learning rate too high: loss oscillates or diverges. Solution: reduce LR by 2-5x or increase warmup
Rank too low: the model does not learn enough. Solution: increase r from 8 to 16 or 32
Rank too high: overfitting, especially with small datasets. Solution: reduce r or increase dropout
Too few epochs: underfitting. Check whether eval loss is still decreasing
Too many epochs: training loss drops but eval loss rises (overfitting)
Unclean dataset: duplicates, errors, inconsistent formatting degrade quality
Forgetting gradient checkpointing: immediate OOM on large models

11. Benchmarks and Comparisons

How do you choose between LoRA, QLoRA, DoRA and full fine-tuning? Here is a systematic comparison based on published benchmarks and community testing.

      Complete Comparison: LoRA vs QLoRA vs DoRA vs Full FT
      
        
            Metric
            Full FT
            LoRA
            QLoRA
            DoRA
          

        
            Quality (MT-Bench)
            7.8
            7.5
            7.3
            7.6
          

            VRAM (7B model)
            ~100 GB
            ~26 GB
            ~10 GB
            ~26 GB
          

            Training speed (rel.)
            1.0x
            1.2x
            0.8x
            1.1x
          

            Trainable parameters
            100%
            0.1-0.5%
            0.1-0.5%
            0.1-0.5% + m
          

            Adaptation storage
            14 GB
            10-50 MB
            10-50 MB
            10-50 MB
          

            Inference overhead
            0%
            0% (merged)
            0% (merged)
            0% (merged)
          

            Minimum GPU (7B)
            4x A100
            1x A100 40GB
            1x RTX 3090
            1x A100 40GB
          

      
    

12. Hardware Requirements

Hardware choice depends on the model you want to fine-tune and the PEFT technique you use. Here is a practical guide for consumer and professional GPUs.

      GPU Requirements for Fine-Tuning
      
        
            GPU
            VRAM
            LoRA (FP16)
            QLoRA (4-bit)
            Notes
          

        
            RTX 3060
            12 GB
            up to 3B
            up to 7B (seq 512)
            Entry-level, limited
          

            RTX 3090
            24 GB
            up to 7B
            up to 13B
            Great for QLoRA 7B
          

            RTX 4090
            24 GB
            up to 7B
            up to 13B
            Faster than 3090, same VRAM
          

            A100 40GB
            40 GB
            up to 13B
            up to 34B
            Professional standard
          

            A100 80GB
            80 GB
            up to 30B
            up to 70B
            Ideal for large models
          

            H100 80GB
            80 GB
            up to 30B
            up to 70B
            Faster than A100, FP8 support
          

      
    

Tips for Consumer GPUs

Limited budget (RTX 3060 12GB): QLoRA on 7B models with seq_length=512, batch_size=1, gradient_accumulation=16
Good value (RTX 3090/4090 24GB): QLoRA on 7-13B models with seq_length=2048, batch_size=4
Cloud computing: Google Colab Pro ($10/month) offers A100 40GB for limited sessions; RunPod and Lambda Labs for intensive use
Always enable: gradient_checkpointing=True, paged optimizers, Flash Attention 2

13. Practical Use Cases

Let us look at some concrete use cases where fine-tuning with LoRA is particularly effective.

13.1 Text Classification

For classification tasks (sentiment analysis, topic classification, spam detection), LoRA is often preferable because:

The task requires few additional parameters (rank r=4-8 is sufficient)
Classification datasets are typically small (1k-50k examples)
The risk of overfitting with high rank is elevated

13.2 Code Generation

For fine-tuning on code generation (e.g., adapting a model to a specific language or corporate conventions), the recommendation is:

Higher rank (r=32-64) because code has more structure and variability
Complete target modules (all 7 Transformer modules)
High-quality dataset: well-documented code, with tests, consistent style
Long sequences (max_seq_length=4096-8192) for complete context

13.3 Domain-Specific Chatbot

To create a specialized chatbot (medical, legal, technical support):

Conversational dataset in the model's chat template format
Include refusal examples ("I cannot answer this question")
Medium rank (r=16-32)
Validation with domain experts, not just automatic metrics

13.4 Summarization

For fine-tuning on summarization tasks:

Dataset with high-quality (document, summary) pairs
Long sequences for input (up to 8192 tokens)
Medium-high rank (r=16-32)
Evaluation with metrics like ROUGE and human evaluation

14. Conclusions and Decision Tree

PEFT techniques have made fine-tuning of large language models accessible to anyone with a consumer GPU. LoRA, QLoRA and DoRA represent the state of the art for parameter-efficient fine-tuning, each with its own strengths.

      Decision Tree: When to Use What
      
            Situation
            Recommended Technique
            Reason
          
            GPU with < 16 GB VRAM
            QLoRA
            Only option for 7B+ models on consumer hardware
          
            GPU with 24-40 GB VRAM
            LoRA (FP16)
            Better quality than QLoRA, faster training
          
            Maximum quality required
            DoRA
            Closest to full FT, minimal overhead on LoRA
          
            Multiple adaptations of the same model
            LoRA
            Small adapters (~50MB), fast switching between tasks
          
            Small model (< 1B parameters)
            Full Fine-Tuning
            For small models, full FT is often feasible and better
          
            Simple task (classification)
            LoRA (r=4-8)
            Low rank is sufficient; avoids overfitting
          
            Complex task (code generation)
            LoRA/DoRA (r=32-64)
            High rank to capture task complexity
          
            Skill composition
            Adapter Fusion
            Combines multiple adaptations in a structured way

The field of efficient fine-tuning is evolving rapidly. New techniques such as GaLore (Gradient Low-Rank Projection) and ReLoRA (iterative LoRA training with increasing rank) promise to further narrow the gap with full fine-tuning. However, LoRA and QLoRA remain the de facto standard for LLM fine-tuning today, with a mature ecosystem of libraries (HuggingFace PEFT, Unsloth, Axolotl) and an active community.

In the next article of the series we will explore Model Quantization: GPTQ, AWQ, INT8, and how to reduce model size by 75% while maintaining quality for production deployment.

Resources and References

LoRA paper: "LoRA: Low-Rank Adaptation of Large Language Models" (Hu et al., 2021)
QLoRA paper: "QLoRA: Efficient Finetuning of Quantized LLMs" (Dettmers et al., 2023)
DoRA paper: "DoRA: Weight-Decomposed Low-Rank Adaptation" (Liu et al., 2024)
Adapters paper: "Parameter-Efficient Transfer Learning" (Houlsby et al., 2019)
HuggingFace PEFT: https://github.com/huggingface/peft
Unsloth: https://github.com/unslothai/unsloth (LoRA 2-5x faster)
TRL (Transformer Reinforcement Learning): https://github.com/huggingface/trl

Component	Formula	Memory
Model weights (FP16)	7B * 2 bytes	14 GB
Gradients (FP16)	7B * 2 bytes	14 GB
Optimizer states (Adam FP32)	7B * 4 bytes * 2 (m, v)	56 GB
Activations (batch_size=1)	Variable	~8-16 GB
Total		~92-106 GB

Technique	Where It Acts	Extra Parameters	Advantages
Prompt Tuning	Input embeddings	Soft tokens prepended to input	Very simple, few parameters
Prefix Tuning	Every attention layer	Trainable K,V prefixes	More expressive than prompt tuning
Adapter Layers	Between Transformer layers	Inserted bottleneck modules	Flexible, composable
LoRA	Existing weight matrices	Low-rank matrices B and A	Zero inference overhead
QLoRA	Like LoRA + quantization	LoRA on 4-bit model	Fine-tuning on consumer GPUs
DoRA	Like LoRA + decomposition	Magnitude + direction	Higher quality than LoRA

Module	Type	Dimensions (LLaMA 7B)	Impact
`q_proj`	Attention Query	4096 x 4096	High - controls what the model "looks for"
`k_proj`	Attention Key	4096 x 4096	High - controls how tokens are indexed
`v_proj`	Attention Value	4096 x 4096	High - controls extracted information
`o_proj`	Attention Output	4096 x 4096	Medium
`gate_proj`	FFN Gate	4096 x 11008	High - controls information flow in FFN
`up_proj`	FFN Up	4096 x 11008	Medium-High
`down_proj`	FFN Down	11008 x 4096	Medium

Component	Full FT (FP16)	LoRA (FP16)	QLoRA (NF4)
Model weights	14 GB (FP16)	14 GB (FP16)	3.5 GB (NF4)
Gradients	14 GB	~16 MB	~16 MB
Optimizer states	56 GB	~64 MB	~64 MB
Activations	~12 GB	~12 GB	~6 GB
Total	~96 GB	~26 GB	~10 GB

Characteristic	Adapter Layers	LoRA
Inference overhead	Yes (additional layers)	No (merged into weights)
Additional latency	~5-10% increase	0%
Composability	High (Adapter Fusion)	Medium (LoRA addition)
Ease of implementation	High	High
Library support	AdapterHub, PEFT	PEFT, unsloth, axolotl

Hyperparameter	Range	Recommended	Notes
rank (r)	4-256	16-64	Increase for complex tasks; r=8 often sufficient for classification
alpha	8-128	2 * rank	alpha/r is the effective scaling; keep ratio constant when changing r
learning_rate	1e-5 - 5e-4	2e-4	Higher than full FT (10-100x); reduce if loss oscillates
batch_size	1-32	4-8	Use gradient_accumulation to simulate larger batches
epochs	1-5	2-3	Watch for overfitting; monitor eval loss
warmup_ratio	0.01-0.1	0.03	Important for stability; higher with high LR
dropout	0.0-0.1	0.05	0.0 for large datasets, 0.1 for small datasets
max_seq_length	512-8192	2048	Larger = more VRAM; adapt to dataset

Metric	Full FT	LoRA	QLoRA	DoRA
Quality (MT-Bench)	7.8	7.5	7.3	7.6
VRAM (7B model)	~100 GB	~26 GB	~10 GB	~26 GB
Training speed (rel.)	1.0x	1.2x	0.8x	1.1x
Trainable parameters	100%	0.1-0.5%	0.1-0.5%	0.1-0.5% + m
Adaptation storage	14 GB	10-50 MB	10-50 MB	10-50 MB
Inference overhead	0%	0% (merged)	0% (merged)	0% (merged)
Minimum GPU (7B)	4x A100	1x A100 40GB	1x RTX 3090	1x A100 40GB

GPU	VRAM	LoRA (FP16)	QLoRA (4-bit)	Notes
RTX 3060	12 GB	up to 3B	up to 7B (seq 512)	Entry-level, limited
RTX 3090	24 GB	up to 7B	up to 13B	Great for QLoRA 7B
RTX 4090	24 GB	up to 7B	up to 13B	Faster than 3090, same VRAM
A100 40GB	40 GB	up to 13B	up to 34B	Professional standard
A100 80GB	80 GB	up to 30B	up to 70B	Ideal for large models
H100 80GB	80 GB	up to 30B	up to 70B	Faster than A100, FP8 support

Situation	Recommended Technique	Reason
GPU with < 16 GB VRAM	QLoRA	Only option for 7B+ models on consumer hardware
GPU with 24-40 GB VRAM	LoRA (FP16)	Better quality than QLoRA, faster training
Maximum quality required	DoRA	Closest to full FT, minimal overhead on LoRA
Multiple adaptations of the same model	LoRA	Small adapters (~50MB), fast switching between tasks
Small model (< 1B parameters)	Full Fine-Tuning	For small models, full FT is often feasible and better
Simple task (classification)	LoRA (r=4-8)	Low rank is sufficient; avoids overfitting
Complex task (code generation)	LoRA/DoRA (r=32-64)	High rank to capture task complexity
Skill composition	Adapter Fusion	Combines multiple adaptations in a structured way