Hi! I'm

Federico Calò

Software Developer | Technical Writer

I create modern web applications and custom digital tools to help businesses grow through technological innovation. My passion is combining computer science and economics to generate real value.

Contact Me

About Me

My passion for computer science was born at the Technical Commercial Institute of Maglie, where I discovered the power of programming and the fascination of creating digital solutions. From the start, I understood that computer science was not just code, but an extraordinary tool for turning ideas into reality.

During my studies in Business Information Systems, I began to interweave computer science and economics, understanding how technology can be the engine of growth for any business. This vision accompanied me to the University of Bari, where I obtained my degree in Computer Science, deepening my technical skills and passion for software development.

Today I put this experience at the service of businesses, professionals and startups, creating tailor-made digital solutions that automate processes, optimize resources and open new business opportunities. Because true innovation begins when technology meets the real needs of people.

My Skills

Data Analysis & Predictive Models

I transform data into strategic insights with in-depth analysis and predictive models for informed decisions

Process Automation

I create custom tools that automate repetitive operations and free up time for value-added activities

Custom Systems

I develop tailor-made software systems, from platform integrations to customized dashboards

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Che tu gestisca un negozio, uno studio professionale o un'azienda, posso aiutarti a sfruttare le potenzialità dell'informatica per lavorare meglio, più velocemente e in modo più intelligente.

Parliamone Insieme →

Join the Community

Join the developer community where we discuss software, AI, architecture and DevOps. Share ideas, ask questions and grow with us.

Channel

FC Dev Blog

Get notifications on new articles, complete series, weekly tips and featured tools. Bilingual IT/EN content directly in your Telegram.

New articles as they are published
Weekly tips and code snippets
Polls on future topics

Subscribe to Channel

Group

FC Dev Community

A bilingual IT/EN community for developers. Discussions, Q&A, mutual help and networking with other professionals.

Discussions on articles and technologies
Coding help and code review
Job opportunities and collaboration

Join the Group

Discussion Topics

View

Master SQL

RoadMap.sh

November 2024

View

Oracle Certified Foundations Associate

Oracle

October 2024

View

People Leadership Credential

Connect

September 2024

💻 Languages & Technologies

Java

Python

JavaScript

Angular

React

TypeScript

SQL

PHP

CSS/SCSS

Node.js

Docker

Git

💼

12/2024 - Present

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italy · Hybrid Analysis and development of computer systems through the use of Java and Quarkus in Health and Public Sector. Continuous training on modern technologies for creating customized and efficient software solutions and on agents.

💼

06/2022 - 12/2024

Software analyst and Back End Developer Associate Consultant

Links Management and Technology SpA

Experience analyzing as-is software systems and ETL flows using PowerCenter. Completed Spring Boot training for developing modern and scalable backend applications. Backend developer specialized in Spring Boot, with experience in database design, analysis, development and testing of assigned tasks.

💼

02/2021 - 10/2021

Software programmer

Adesso.it (prima era WebScience srl)

Experience in AS-IS and TO-BE analysis, SEO evolutions and website evolutions to improve user performance and engagement.

🎓

2018 - 2025

Degree in Computer Science

University of Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Corporate Information Systems

Technical Commercial Institute of Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

Model Quantization: INT8, INT4, GPTQ, AWQ and Beyond

A full-precision GPT-4 model occupies hundreds of gigabytes. Llama-3 70B in FP16 requires 140 GB of VRAM — making it impossible to run on consumer hardware without quantization. With INT4 quantization, the same Llama-3 70B drops to 35 GB, fitting in two RTX 4090s or a system with 64 GB of RAM. The accuracy loss? Often less than 1%.

Model quantization has evolved from a memory-saving trick into a foundational technique for the modern LLM ecosystem. Algorithms like GPTQ, AWQ, SmoothQuant and the GGUF format from llama.cpp have democratized access to large language models, making them deployable on consumer hardware, edge devices, and even Raspberry Pis.

This guide covers quantization end-to-end: from the underlying math to choosing the right method for your specific use case, with working code examples for each technique.

What You Will Learn

Why quantization is essential for modern AI deployment
PTQ vs QAT: when each approach makes sense
INT8 quantization with bitsandbytes and SmoothQuant
INT4 quantization with NF4, FP4, and QLoRA
How the GPTQ algorithm works internally
AWQ: activation-aware weight quantization advantages
GGUF format and llama.cpp quantization levels
Accuracy vs speed vs memory benchmarks
Quantization-Aware Training with PyTorch
Best practices and deployment decision guide

The VRAM Problem: Why Quantization Matters

A single parameter in FP32 takes 4 bytes. In FP16/BF16 it takes 2 bytes. With INT8 quantization, 1 byte; with INT4, only 0.5 bytes. The table below shows memory requirements for Llama-3 70B across precisions:

Precision	Bytes/param	Memory (70B)	Hardware needed
FP32	4 bytes	280 GB	Impossible on consumer
BF16 / FP16	2 bytes	140 GB	2x A100 80GB
INT8	1 byte	70 GB	1x A100 80GB
INT4 / NF4	0.5 bytes	35 GB	2x RTX 4090 (24GB)
INT3 / Q3_K_M	0.375 bytes	~26 GB	RTX 3090 + RAM offload

Beyond VRAM, quantization delivers benefits in throughput (tokens/sec), inference latency, and cloud costs. On Apple M-series chips, Raspberry Pi, or NVIDIA Jetson, quantization is the only way to run models beyond 1B parameters.

Market Data 2026

Gartner forecasts that by 2027, quantized Small Language Models (SLMs) will surpass cloud LLMs in deployment frequency by a factor of 3x. On-device AI reduces operational costs by 70% compared to cloud APIs, eliminating network latency and per-token billing. Quantization has become mission-critical for edge AI, mobile deployment, and embedded systems.

Quantization Math: From Float to Integer

Quantization maps a continuous floating-point value to a discrete integer. The process involves two operations: quantization (float to int) and dequantization (int back to float for computation).

# Uniform INT8 quantization (fundamental scheme)
# W_quantized = round(W / scale) + zero_point
# W_dequantized = (W_quantized - zero_point) * scale

import torch

def quantize_tensor_int8(tensor: torch.Tensor, symmetric: bool = True):
    """
    Uniform INT8 quantization.
    symmetric=True: zero_point=0, range [-127, 127]
    symmetric=False: asymmetric range with zero_point offset
    """
    if symmetric:
        max_val = tensor.abs().max().item()
        scale = max_val / 127.0
        zero_point = 0
    else:
        min_val = tensor.min().item()
        max_val = tensor.max().item()
        scale = (max_val - min_val) / 255.0
        zero_point = round(-min_val / scale)
        zero_point = max(0, min(255, zero_point))

    # Quantize: FP16 -> INT8
    quantized = torch.clamp(
        torch.round(tensor / scale) + zero_point,
        -128, 127
    ).to(torch.int8)

    # Dequantize: INT8 -> FP16 (for error measurement)
    dequantized = (quantized.float() - zero_point) * scale
    error = (tensor - dequantized).abs().mean().item()

    return quantized, scale, zero_point, error

# Practical example
W = torch.randn(1024, 1024, dtype=torch.float16)
q, scale, zp, err = quantize_tensor_int8(W, symmetric=True)

print(f"Original: {W.dtype}, {W.element_size() * W.numel() / 1024:.1f} KB")
print(f"Quantized: {q.dtype}, {q.element_size() * q.numel() / 1024:.1f} KB")
print(f"Memory reduction: {(1 - q.element_size()/W.element_size()) * 100:.0f}%")
print(f"Mean absolute error: {err:.6f}")
# Typical output:
# Original: torch.float16, 2048.0 KB
# Quantized: torch.int8, 1024.0 KB
# Memory reduction: 50%
# Mean absolute error: 0.000394

PTQ vs QAT: Choosing the Right Paradigm

Two fundamental paradigms exist for model quantization:

PTQ (Post-Training Quantization): applied after training to an already trained model. Requires only a small calibration dataset. Fast and practical, but may degrade accuracy on small models or very low precision (INT2, INT3).
QAT (Quantization-Aware Training): simulates quantization during training, allowing the model to adapt its weights to precision loss. Produces better results but requires compute resources comparable to full fine-tuning.

PTQ vs QAT: Practical Comparison

Aspect	PTQ	QAT
Time required	Minutes to hours	Hours to days
Calibration data	Small (512-2048 samples)	Full training dataset
VRAM for quantization	Low (forward pass only)	High (full backward pass)
INT8 accuracy	Excellent (<0.5% loss)	Excellent
INT4 accuracy	Good (1-3% loss)	Very good (<1% loss)
Best for	Large models, fast production	Small models, max quality

For modern LLMs (7B+ parameters), PTQ is generally sufficient: parameter redundancy means INT4 quantization preserves most capabilities. For models under 3B parameters, QAT is recommended when accuracy is critical. Recent PyTorch experiments show QAT can recover up to 96% of accuracy degradation on HellaSwag compared to PTQ for Llama-3.

INT8 with bitsandbytes: The Easiest Path

bitsandbytes is the most widely used library for practical LLM quantization. Originally developed by Tim Dettmers, it supports INT8 and INT4 (NF4, FP4) and integrates natively with Hugging Face Transformers. The key advantage: no calibration dataset needed, quantization happens on-the-fly at model load time.

# pip install bitsandbytes transformers accelerate

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# === INT8 CONFIGURATION ===
config_int8 = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,        # Outlier threshold (optimal default)
    llm_int8_has_fp16_weight=False
)

# === INT4 NF4 CONFIGURATION (QLoRA style) ===
config_int4 = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",             # NF4: optimal for normally-distributed weights
    bnb_4bit_compute_dtype=torch.bfloat16, # Compute happens in BF16 (not INT4!)
    bnb_4bit_use_double_quant=True,        # Double quantization: saves ~0.4 bits/param
    bnb_4bit_quant_storage=torch.uint8     # Storage format
)

model_name = "meta-llama/Llama-3.1-8B-Instruct"

# Load with INT4 NF4
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=config_int4,
    device_map="auto",            # Automatically distributes across GPU/CPU
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Check memory usage
mem_gb = model.get_memory_footprint() / 1024**3
print(f"Model memory (INT4): {mem_gb:.2f} GB")
# Llama-3.1-8B: ~4.5 GB vs 16 GB in BF16

# Inference is identical to full-precision model
inputs = tokenizer("Explain model quantization briefly:", return_tensors="pt").to("cuda")
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.7,
        do_sample=True
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

bitsandbytes Limitations

Quantization happens at load time: the quantized model is not saved automatically. Use GPTQ or AWQ to serialize quantized weights.
bitsandbytes INT8 uses a mixed approach: weights at 8-bit, but activation outliers are handled in FP16 (LLM.int8() method).
Computation always happens in BF16/FP16, not INT4: quantization reduces memory but does not accelerate compute as much as GPTQ/AWQ with optimized kernels.
Performance on CPU-only systems without CUDA can be poor.

GPTQ: Layer-by-Layer Quantization

GPTQ (Frantar et al. 2022) is an advanced PTQ algorithm that quantizes each layer independently by minimizing reconstruction error. It uses the Hessian matrix (approximated via calibration data) to determine which weights are most sensitive to quantization and how to compensate for residual error column by column.

# pip install auto-gptq optimum

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset
import torch

model_name = "meta-llama/Llama-3.1-8B-Instruct"
output_dir = "./llama-3.1-8b-gptq-int4"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

# === CALIBRATION DATASET ===
# GPTQ needs 128-512 calibration samples
# More representative = better accuracy retention
calibration_data = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
calibration_texts = [
    text for text in calibration_data["text"]
    if len(text.strip()) > 100
][:128]

calibration_tokens = [
    tokenizer(text, return_tensors="pt", truncation=True, max_length=2048)
    for text in calibration_texts
]

# === QUANTIZATION CONFIGURATION ===
quantize_config = BaseQuantizeConfig(
    bits=4,              # INT4 quantization
    group_size=128,      # Group size (smaller = more accurate, more memory)
    damp_percent=0.01,   # Damping factor for numerical stability
    desc_act=True,       # Activation ordering (improves quality)
    sym=True,            # Symmetric quantization
)

# === LOAD AND QUANTIZE ===
print("Loading model in FP16...")
model = AutoGPTQForCausalLM.from_pretrained(
    model_name,
    quantize_config=quantize_config,
    torch_dtype="auto"
)

print("Starting GPTQ quantization (~30-60 min on A100)...")
model.quantize(
    calibration_tokens,
    use_triton=False,     # Set True for optimized Triton inference kernels
    batch_size=1,
    cache_examples_on_gpu=True
)

# Save quantized model
model.save_quantized(output_dir, use_safetensors=True)
tokenizer.save_pretrained(output_dir)
print(f"GPTQ model saved to: {output_dir}")

# === LOAD PRE-QUANTIZED GPTQ MODEL ===
model_gptq = AutoGPTQForCausalLM.from_quantized(
    output_dir,
    use_triton=False,
    device_map="auto",
    inject_fused_attention=True,   # Memory-efficient attention
    inject_fused_mlp=True          # Memory-efficient MLP
)

inputs = tokenizer("The GPTQ algorithm works by:", return_tensors="pt").to("cuda")
with torch.no_grad():
    outputs = model_gptq.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

GPTQ produces serializable quantized models. Many models on Hugging Face Hub with the -GPTQ or -4bit suffix use this algorithm. Quantization takes time (typically 30-90 minutes for a 13B model on A100) but happens once: the quantized model can be reused without re-quantizing.

AWQ: Activation-Aware Weight Quantization

AWQ (Lin et al. 2023) starts from a different observation: not all weights are equally important. A small percentage (about 1%) corresponds to large-magnitude activations and contributes disproportionately to model predictions. Preserving these "salient weights" at higher precision dramatically reduces overall quantization error.

AWQ scales important weights before quantization, reducing error on critical channels. The result is quality comparable to or better than GPTQ, often with faster quantization and better performance on heterogeneous hardware (CPU, Mac M-series, mobile).

# pip install autoawq

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = "meta-llama/Llama-3.1-8B-Instruct"
output_dir = "./llama-3.1-8b-awq-int4"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

print("Loading model for AWQ quantization...")
model = AutoAWQForCausalLM.from_pretrained(
    model_name,
    safetensors=True,
    device_map="cuda",
    trust_remote_code=True
)

# AWQ configuration
quant_config = {
    "zero_point": True,     # Asymmetric quantization (better for LLMs)
    "q_group_size": 128,    # Group size
    "w_bit": 4,             # 4-bit quantization
    "version": "GEMM"       # GEMM: balanced speed/quality
                            # GEMV: optimized for batch_size=1 (chatbot)
}

# Domain-specific calibration data (use representative examples)
calib_data = [
    "Quantization enables running large models on consumer hardware.",
    "Transformers revolutionized NLP with the attention mechanism.",
    "LoRA fine-tuning reduces trainable parameters significantly.",
    # Add 128-256 representative examples for your target domain
]

print("Starting AWQ quantization...")
model.quantize(
    tokenizer,
    quant_config=quant_config,
    calib_data=calib_data
)

model.save_quantized(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"AWQ model saved: {output_dir}")

# === LOAD AND RUN AWQ MODEL ===
model_awq = AutoAWQForCausalLM.from_quantized(
    output_dir,
    fuse_layers=True,          # Fused kernel optimization
    trust_remote_code=True,
    safetensors=True
)

from transformers import pipeline
pipe = pipeline(
    "text-generation",
    model=model_awq,
    tokenizer=tokenizer,
    device_map="auto"
)

result = pipe("Explain AWQ quantization in one paragraph:", max_new_tokens=150)
print(result[0]["generated_text"])

GPTQ vs AWQ: Decision Guide

GPTQ: best for NVIDIA GPU servers with CUDA. Fastest inference with Triton kernels. Industry standard for GPU deployment. Better for batch processing.
AWQ: best for heterogeneous hardware (CPU, Mac, mobile). Faster to quantize. Preferred for chatbot apps (batch=1). Effective with GEMV kernel for single-token generation.
Practical rule: GPTQ for dedicated GPU servers, AWQ for cross-platform and edge deployment.

GGUF and llama.cpp: Quantization for CPU and Edge

The GGUF format (GGML Unified Format) was created by the llama.cpp project to enable LLM inference on CPU, with optional GPU offloading via Metal (Apple), CUDA, or OpenCL. GGUF succeeds GGML and resolves forward-compatibility issues between versions.

GGUF naming follows the pattern Q[bits]_[variant]. The most common formats:

Format	Avg bits	Quality	Recommended for
Q8_0	8.0 bits	Near-lossless	Maximum quality on powerful CPU
Q6_K	6.6 bits	Excellent	Quality/size balance
Q5_K_M	5.7 bits	Very good	Desktop with 16+ GB RAM
Q4_K_M	4.8 bits	Good (95%)	Recommended default, 8+ GB laptop
Q3_K_M	3.9 bits	Acceptable	Very memory-constrained hardware
Q2_K	2.6 bits	Degraded	Extreme testing only

The _K_M suffix indicates "K-quantization" at "Medium" size: a technique that uses block quantization with higher-precision scale factors for critical layers, yielding better quality than uniform quantization at the same bit width.

# Convert and quantize with llama.cpp
# First: build llama.cpp
# git clone https://github.com/ggerganov/llama.cpp
# cd llama.cpp && make -j4

# 1. Convert HuggingFace model -> GGUF FP16
python convert_hf_to_gguf.py \
    meta-llama/Llama-3.1-8B-Instruct \
    --outfile llama-3.1-8b-f16.gguf \
    --outtype f16

# 2. Quantize to different formats
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q8_0.gguf Q8_0
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q5_k_m.gguf Q5_K_M

# 3. Benchmark with llama.cpp
./llama-bench \
    -m llama-3.1-8b-q4_k_m.gguf \
    -p 512 \      # Prompt tokens
    -n 128 \      # Tokens to generate
    -t 8          # CPU threads

# Typical output on M2 Pro (16 GB):
# Q4_K_M: prompt 45.2 t/s, generate 28.1 t/s
# Q8_0:   prompt 24.1 t/s, generate 15.8 t/s

# === PYTHON USAGE via llama-cpp-python ===
# pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama(
    model_path="./llama-3.1-8b-q4_k_m.gguf",
    n_ctx=4096,         # Context window
    n_threads=8,        # CPU threads
    n_gpu_layers=35,    # Offload 35 layers to GPU (0 = CPU only)
    verbose=False
)

response = llm(
    "Q: How does model quantization work? A:",
    max_tokens=256,
    stop=["Q:", "\n\n"],
    echo=True
)
print(response["choices"][0]["text"])

# === GGUF WITH OLLAMA (simpler approach) ===
modelfile_content = """
FROM ./llama-3.1-8b-q4_k_m.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
SYSTEM "You are a technical assistant specializing in deep learning."
"""

with open("Modelfile", "w") as f:
    f.write(modelfile_content)

# ollama create my-llama -f Modelfile
# ollama run my-llama

Accuracy vs Speed vs Memory Benchmarks

Standard metrics for evaluating quantized model quality include perplexity on Wikitext-2, reasoning benchmarks (HellaSwag, MMLU), and domain-specific tasks.

# Automated evaluation with lm-evaluation-harness
# pip install lm-eval

# BF16 baseline evaluation
lm_eval --model hf \
    --model_args "pretrained=meta-llama/Llama-3.1-8B-Instruct" \
    --tasks hellaswag,mmlu \
    --batch_size 4 \
    --output_path results_bf16/

# GPTQ INT4 evaluation
lm_eval --model hf \
    --model_args "pretrained=./llama-3.1-8b-gptq-int4,use_auto_gptq=True" \
    --tasks hellaswag,mmlu \
    --batch_size 4 \
    --output_path results_gptq_int4/

# === MEMORY AND SPEED BENCHMARK SCRIPT ===
import torch
import time
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

def benchmark_model(model_name_or_path, quant_config=None, n_tokens=100, n_runs=5):
    """Complete benchmark: memory, latency, throughput."""
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
    model = AutoModelForCausalLM.from_pretrained(
        model_name_or_path,
        quantization_config=quant_config,
        device_map="cuda",
        torch_dtype=torch.bfloat16 if quant_config is None else None
    )

    mem_gb = model.get_memory_footprint() / 1024**3
    prompt = "Explain the transformer architecture in detail:" * 3
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    # Warm-up
    with torch.no_grad():
        model.generate(**inputs, max_new_tokens=10)

    # Benchmark
    torch.cuda.synchronize()
    latencies = []
    for _ in range(n_runs):
        start = time.perf_counter()
        with torch.no_grad():
            out = model.generate(**inputs, max_new_tokens=n_tokens)
        torch.cuda.synchronize()
        latencies.append(time.perf_counter() - start)

    avg_latency = sum(latencies) / len(latencies)
    throughput = n_tokens / avg_latency

    return {
        "memory_gb": round(mem_gb, 2),
        "latency_sec": round(avg_latency, 3),
        "throughput_tps": round(throughput, 1)
    }

# Compare BF16 vs INT8 vs INT4
results = {}

results["BF16"] = benchmark_model("meta-llama/Llama-3.1-8B-Instruct")

config_8bit = BitsAndBytesConfig(load_in_8bit=True)
results["INT8"] = benchmark_model("meta-llama/Llama-3.1-8B-Instruct", config_8bit)

config_4bit = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True
)
results["INT4-NF4"] = benchmark_model("meta-llama/Llama-3.1-8B-Instruct", config_4bit)

for name, r in results.items():
    print(f"{name:10} | Mem: {r['memory_gb']:5.2f} GB | "
          f"Latency: {r['latency_sec']:.3f}s | "
          f"Throughput: {r['throughput_tps']:.1f} t/s")

# Typical results for Llama-3.1-8B on RTX 3090:
# BF16       | Mem: 16.02 GB | Latency: 3.821s | Throughput: 26.2 t/s
# INT8       | Mem:  8.51 GB | Latency: 4.103s | Throughput: 24.4 t/s
# INT4-NF4   | Mem:  4.89 GB | Latency: 3.412s | Throughput: 29.3 t/s

Benchmark Summary (Llama-3.1-8B on RTX 4090)

Method	Memory	Throughput	HellaSwag	Perplexity
BF16 (baseline)	16.0 GB	38 t/s	82.1%	6.14
INT8 (bitsandbytes)	8.5 GB	35 t/s	81.8%	6.21
INT4 NF4 (bnb)	4.9 GB	42 t/s	81.2%	6.47
GPTQ INT4	4.8 GB	55 t/s	81.5%	6.39
AWQ INT4	4.7 GB	52 t/s	81.6%	6.35
Q4_K_M (GGUF, CPU)	4.9 GB	18 t/s	81.3%	6.42

Note: indicative values, vary with hardware, specific model, and batch size.

Quantization-Aware Training with PyTorch

For scenarios where PTQ is insufficient — typically models under 3B parameters or quantization below 4 bits — QAT allows recovering significant accuracy. PyTorch includes native QAT support from version 2.0, with static and dynamic INT8 support.

import torch
import torch.nn as nn
from torch.ao.quantization import (
    prepare_qat_fx, convert_fx,
    get_default_qat_qconfig_mapping
)

# === MODEL DEFINITION ===
class SimpleTransformerBlock(nn.Module):
    def __init__(self, d_model=256, nhead=4, ff_dim=1024):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, nhead, batch_first=True)
        self.norm1 = nn.LayerNorm(d_model)
        self.ff = nn.Sequential(
            nn.Linear(d_model, ff_dim),
            nn.ReLU(),
            nn.Linear(ff_dim, d_model)
        )
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x):
        attn_out, _ = self.attn(x, x, x)
        x = self.norm1(x + attn_out)
        return self.norm2(x + self.ff(x))

class SimpleModel(nn.Module):
    def __init__(self, vocab_size=1000, d_model=256, n_layers=4):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
        self.blocks = nn.Sequential(*[
            SimpleTransformerBlock(d_model) for _ in range(n_layers)
        ])
        self.head = nn.Linear(d_model, vocab_size)

    def forward(self, x):
        return self.head(self.blocks(self.embed(x)))

# === QAT SETUP ===
model = SimpleModel()
model.train()

# QConfig specifies how to quantize activations and weights
qconfig_mapping = get_default_qat_qconfig_mapping("x86")

# Trace model with example input
example_input = torch.randint(0, 1000, (2, 32))

# Prepare for QAT: inserts FakeQuantize nodes that simulate
# quantization during the forward pass
model_prepared = prepare_qat_fx(model, qconfig_mapping, example_input)

# === QAT TRAINING LOOP ===
optimizer = torch.optim.Adam(model_prepared.parameters(), lr=1e-4)
criterion = nn.CrossEntropyLoss()

def train_qat(model, n_epochs=10, freeze_quantizer_epoch=8):
    for epoch in range(n_epochs):
        # Synthetic data (replace with real dataset)
        x = torch.randint(0, 1000, (32, 64))
        y = torch.randint(0, 1000, (32, 64))

        optimizer.zero_grad()
        out = model(x)
        loss = criterion(out.view(-1, 1000), y.view(-1))
        loss.backward()
        optimizer.step()

        # Freeze quantizer after N epochs: stabilizes scale factors
        if epoch == freeze_quantizer_epoch:
            model.apply(torch.quantization.disable_observer)
            print(f"Epoch {epoch}: FakeQuantize observers frozen")

        if epoch % 2 == 0:
            print(f"Epoch {epoch}/{n_epochs}, Loss: {loss.item():.4f}")

train_qat(model_prepared)

# === CONVERT TO REAL INT8 ===
model_prepared.eval()
model_int8 = convert_fx(model_prepared)

torch.save(model_int8.state_dict(), "model_qat_int8.pt")
print("QAT model saved as INT8 successfully")

Best Practices and Anti-Patterns

Quantization Best Practices

Match method to deployment target: GPTQ INT4 for GPU servers; GGUF Q4_K_M for CPU/edge; NF4 with bitsandbytes for fine-tuning workflows.
Use representative calibration data: for GPTQ and AWQ, use domain-specific samples, not generic text. 128-512 samples are enough but must be meaningful.
Always benchmark on domain-specific tasks: Wikitext perplexity is a proxy metric that may miss quality degradation on code, math, or non-English languages.
Optimal group size: group_size=128 is the safe default; 64 improves quality at higher memory cost; 256 saves memory but hurts quality.
Enable double quantization: always set bnb_4bit_use_double_quant=True; it saves ~0.4 bits/param with minimal quality impact.
Use BF16 as compute dtype: bnb_4bit_compute_dtype=torch.bfloat16 is more numerically stable than FP16 and supported on Ampere+ (RTX 3000+, A100).

Anti-Patterns to Avoid

Never quantize critical layers uniformly: first and last layers (embedding, LM head) are more sensitive. GPTQ and AWQ exclude them automatically; replicate this behavior in custom implementations.
Avoid FP4 for LLM inference: NF4 is superior to FP4 for LLM weights because they follow a normal distribution. NF4 is information-theoretically optimal for normally distributed values.
Never compare methods without equal benchmarks: INT4 with GPTQ and INT4 with bitsandbytes have different effective quality. Always run the same evaluation suite before conclusions.
Watch for dequantization overhead: bitsandbytes INT4 must dequantize during each forward pass. On memory-bandwidth-limited GPUs, INT8 may outperform INT4.
Do not use Q2_K in production: quality degrades too much for most use cases. Q3_K_M is the minimum acceptable level for simple tasks.

SmoothQuant: Solving the Activation Outlier Problem

A fundamental challenge in LLM quantization is the presence of activation outliers: a small number of channels (typically 0.1-1% of hidden dimensions) have magnitudes 100-1000x larger than the rest. Quantizing these with the same scale as normal activations causes catastrophic precision loss for the majority of values. SmoothQuant (Xiao et al., 2022) elegantly solves this by migrating quantization difficulty from activations to weights, which are inherently easier to quantize.

The core idea: if activations have outlier channels, multiply them by a smoothing factor s (reducing their magnitude) and divide the corresponding weight channels by the same s (compensating mathematically). After smoothing, both activations and weights fall within INT8 range. This is a mathematically equivalent transformation that enables true W8A8 (weights INT8, activations INT8) quantization on hardware.

import torch
import torch.nn as nn

# ===================================================================
# SMOOTHQUANT: MIGRATING QUANTIZATION DIFFICULTY
# ===================================================================
# Key insight: Y = (X * diag(s)^-1) @ (diag(s) * W^T)
#            = X_smooth @ W_smooth^T
# Where s is a per-channel smoothing factor that balances difficulty.
# After this transform, both X_smooth and W_smooth are easy to quantize.

def compute_smoothing_factors(
    activation_stats: torch.Tensor,   # Per-channel max: [hidden_dim]
    weight: torch.Tensor,             # [out_features, in_features]
    alpha: float = 0.5                # Migration strength (0=weight-side, 1=act-side)
) -> torch.Tensor:
    """
    Compute per-channel smoothing factors s.

    alpha controls the migration balance:
    - alpha=0.5: balanced (default, works for most models)
    - alpha closer to 1.0: more migration to weights (helps when activations are very spiky)
    - alpha closer to 0.0: more migration to activations (helps when weights are spiky)

    Formula: s_j = max(|X_j|)^alpha / max(|W_j|)^(1-alpha)
    """
    # Per-channel stats for weights: max absolute value per input feature
    weight_max = weight.abs().max(dim=0).values  # [in_features]
    act_max = activation_stats.clamp(min=1e-8)   # [in_features] - avoid division by zero

    # Smoothing factor balances the quantization ranges
    smooth_factor = (act_max ** alpha) / (weight_max ** (1 - alpha))
    smooth_factor = smooth_factor.clamp(min=1e-8)  # Numerical stability

    return smooth_factor


class SmoothQuantLinear(nn.Module):
    """
    Linear layer with SmoothQuant applied.
    In practice, SmoothQuant is applied offline (pre-deployment) to
    the pre-trained model weights. Here we show the conceptual approach.
    """
    def __init__(
        self,
        in_features: int,
        out_features: int,
        smooth_factor: torch.Tensor = None
    ):
        super().__init__()
        self.linear = nn.Linear(in_features, out_features)
        if smooth_factor is not None:
            self.register_buffer('smooth_factor', smooth_factor)
            # Apply smoothing to weights offline: W_smooth = W * diag(s)
            with torch.no_grad():
                self.linear.weight.data = (
                    self.linear.weight.data / smooth_factor.unsqueeze(0)
                )
        else:
            self.register_buffer('smooth_factor', torch.ones(in_features))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Apply inverse smoothing to activations: X_smooth = X / diag(s)
        x_smooth = x / self.smooth_factor
        # The weight is already smoothed offline
        return self.linear(x_smooth)


# ===================================================================
# PRACTICAL SMOOTHQUANT APPLICATION
# ===================================================================

def apply_smoothquant(
    model: nn.Module,
    calibration_activations: dict,   # {layer_name: activation_max_per_channel}
    alpha: float = 0.5
) -> nn.Module:
    """
    Apply SmoothQuant to all Linear layers in a model.
    calibration_activations: collected during calibration forward pass.
    """
    for name, module in model.named_modules():
        if not isinstance(module, nn.Linear):
            continue
        if name not in calibration_activations:
            continue

        act_max = calibration_activations[name]  # [in_features]
        smooth_factor = compute_smoothing_factors(
            act_max,
            module.weight,
            alpha=alpha
        )

        # Apply smoothing to weight (offline transformation)
        with torch.no_grad():
            module.weight.data = module.weight.data / smooth_factor.unsqueeze(0)

        # Smoothing of activations happens at inference via the preceding LayerNorm
        # (the scale is absorbed into the LayerNorm weight, making it zero-overhead)
        if hasattr(module, '_smooth_factor'):
            module._smooth_factor = smooth_factor
        else:
            module.register_buffer = lambda name, tensor: None  # Simplified

    return model


# ===================================================================
# DEMONSTRATION: OUTLIER ANALYSIS
# ===================================================================

def analyze_activation_outliers(activations: torch.Tensor, threshold: float = 6.0):
    """
    Analyze activation outliers (the problem SmoothQuant solves).
    LLM.int8() uses threshold=6.0 as default.
    """
    B, N, D = activations.shape if activations.dim() == 3 else (1, 1, activations.shape[-1])
    flat = activations.abs()

    outlier_channels = (flat.max(dim=0).values > threshold).sum().item()
    total_channels = D
    outlier_ratio = outlier_channels / total_channels

    print(f"Activation analysis:")
    print(f"  Max activation: {flat.max().item():.2f}")
    print(f"  Mean activation: {flat.mean().item():.4f}")
    print(f"  Outlier channels (>{threshold}): {outlier_channels}/{total_channels} ({outlier_ratio:.1%})")
    print(f"  Quantization range without smoothing: [{-flat.max().item():.2f}, {flat.max().item():.2f}]")
    print(f"  -> INT8 step size: {flat.max().item()*2/255:.4f} (very coarse for normal values!)")

    return outlier_channels, outlier_ratio


# Simulate LLM activations (Gaussian + heavy-tail outliers in some channels)
torch.manual_seed(42)
d_model = 4096
n_samples = 128
activations = torch.randn(n_samples, d_model)
# Add outliers to ~0.5% of channels
outlier_channels_idx = torch.randperm(d_model)[:int(d_model * 0.005)]
activations[:, outlier_channels_idx] *= 150  # 150x larger than normal

analyze_activation_outliers(activations, threshold=6.0)
# Output:
# Max activation: ~212.5
# Mean activation: 0.4821
# Outlier channels (>6.0): ~20/4096 (0.5%)
# INT8 step size: 1.667 (horrible precision for values < 5!)
# After SmoothQuant: max ~5.2, INT8 step ~0.041 (40x better!)

Modern Ultra-Low Precision: INT2 and Extreme Quantization

The frontier of quantization has pushed beyond INT4 toward extreme compression. In 2024-2025, several methods achieved INT2-INT3 quantization with acceptable quality for many use cases. QuIP# (Tseng et al., 2024) uses randomized Hadamard transforms to incoherize weight matrices before quantization, enabling true 2-bit quantization. AQLM (Additive Quantization of Language Models) uses vector quantization codebooks. These methods target a new edge deployment tier: 70B models in under 20 GB RAM.

import torch
import torch.nn as nn
import math

# ===================================================================
# HADAMARD RANDOMIZED QUANTIZATION (QuIP# principle)
# ===================================================================
# Core idea: applying a random orthogonal transformation to weights
# makes the quantization error more uniform (incoherent).
# Incoherence = no single dimension dominates the error.
# This allows 2-bit quantization with much less quality loss.

def hadamard_transform(x: torch.Tensor, normalize: bool = True) -> torch.Tensor:
    """
    Applies the Walsh-Hadamard transform to the last dimension.
    H_n is a 2^n x 2^n matrix with entries +1/-1, orthogonal.
    Transform: y = H @ x / sqrt(n)
    """
    d = x.shape[-1]
    # d must be a power of 2
    assert d > 0 and (d & (d - 1)) == 0, "Dimension must be power of 2"

    x = x.clone()
    n = d
    h = 1
    while h < n:
        for i in range(0, n, h * 2):
            for j in range(i, i + h):
                a = x[..., j].clone()
                b = x[..., j + h].clone()
                x[..., j] = a + b
                x[..., j + h] = a - b
        h *= 2

    if normalize:
        x = x / math.sqrt(d)
    return x


class QuIPQuantizer:
    """
    QuIP# style quantizer: randomized incoherence preprocessing + INT2.
    Simplified version for educational purposes.
    Production QuIP# uses optimized CUDA kernels for the Hadamard transform.
    """
    def __init__(self, n_bits: int = 2, group_size: int = 256):
        self.n_bits = n_bits
        self.group_size = group_size
        self.n_levels = 2 ** n_bits  # 4 levels for INT2, 16 for INT4

    def quantize_weight(self, W: torch.Tensor, seed: int = 42) -> tuple:
        """
        QuIP#-style quantization:
        1. Apply random orthogonal transform (Hadamard)
        2. Quantize in transformed space
        3. Store transform seed (not the full matrix)
        """
        out_features, in_features = W.shape

        # Pad to power of 2 if needed
        pad_to = 2 ** math.ceil(math.log2(in_features))
        if pad_to > in_features:
            W_padded = torch.zeros(out_features, pad_to)
            W_padded[:, :in_features] = W
        else:
            W_padded = W

        # Randomize column order (simulates random rotation)
        torch.manual_seed(seed)
        perm = torch.randperm(pad_to)
        W_perm = W_padded[:, perm]

        # Apply Hadamard transform per row
        W_transformed = hadamard_transform(W_perm, normalize=True)

        # Quantize to n_bits
        W_q, scale, _ = self._uniform_quantize(W_transformed[:, :in_features])

        return W_q, scale, perm[:in_features], seed

    def _uniform_quantize(self, x: torch.Tensor) -> tuple:
        """Group-wise uniform quantization."""
        B, D = x.shape
        n_groups = max(1, D // self.group_size)

        x_grouped = x[:, :n_groups * self.group_size].reshape(B * n_groups, self.group_size)
        scale = x_grouped.abs().max(dim=-1, keepdim=True).values / ((self.n_levels // 2) - 1)
        scale = scale.clamp(min=1e-8)

        x_q = torch.round(x_grouped / scale).clamp(-(self.n_levels // 2), self.n_levels // 2 - 1)
        x_q = x_q.reshape(B, n_groups * self.group_size)
        scale = scale.reshape(B, n_groups)

        return x_q.to(torch.int8), scale, n_groups


# ===================================================================
# GGUF ULTRA-LOW PRECISION BENCHMARKS
# ===================================================================

gguf_ultra_low = [
    {"format": "Q2_K",    "bits": 2.6,  "llama8b_gb": 2.9,  "ppl_loss": "+4.2",  "usable": False},
    {"format": "Q3_K_S",  "bits": 3.0,  "llama8b_gb": 3.4,  "ppl_loss": "+2.1",  "usable": True},
    {"format": "Q3_K_M",  "bits": 3.9,  "llama8b_gb": 3.9,  "ppl_loss": "+1.4",  "usable": True},
    {"format": "Q4_K_M",  "bits": 4.8,  "llama8b_gb": 4.9,  "ppl_loss": "+0.6",  "usable": True},
    {"format": "Q5_K_M",  "bits": 5.7,  "llama8b_gb": 5.7,  "ppl_loss": "+0.3",  "usable": True},
    {"format": "Q6_K",    "bits": 6.6,  "llama8b_gb": 6.6,  "ppl_loss": "+0.1",  "usable": True},
    {"format": "Q8_0",    "bits": 8.0,  "llama8b_gb": 8.1,  "ppl_loss": "~0.0",  "usable": True},
]

print(f"\n{'Format':<12} {'Avg bits':>10} {'Size (8B)':>12} {'PPL loss':>10} {'Usable?':>10}")
print("-" * 60)
for fmt in gguf_ultra_low:
    print(f"{fmt['format']:<12} {fmt['bits']:>10.1f} {str(fmt['llama8b_gb'])+'GB':>12} "
          f"{fmt['ppl_loss']:>10} {str(fmt['usable']):>10}")

print("\nRecommendation:")
print("  Minimum for production: Q3_K_M (3.9 bits, +1.4 perplexity)")
print("  Sweet spot: Q4_K_M (4.8 bits, +0.6 perplexity)")
print("  High quality: Q5_K_M or Q6_K")
print("  Q2_K is only for extreme RAM constraints (testing only)")

# ===================================================================
# AQLM: ADDITIVE QUANTIZATION (2-bit via codebook)
# ===================================================================
# AQLM represents weight sub-vectors using a learned codebook.
# Instead of scalar quantization (each weight -> int), it uses
# vector quantization: groups of d weights -> one codebook index.
# This achieves effective 2-bit with much better quality than
# scalar INT2, because the codebook captures weight correlations.
#
# AQLM codebook structure:
# - C codebooks, each with K entries of size d
# - A weight vector w of size d is encoded as: w ≈ sum_c(C_c[I_c])
# - For d=8, K=256, C=2: effective 2 bits/param (2*8bits/8weights = 2bpw)
#
# Loading AQLM-quantized model (example):
# from transformers import AutoModelForCausalLM
# model = AutoModelForCausalLM.from_pretrained(
#     "ISTA-DASLab/Llama-3-8B-AQLM-2Bit-1x16",
#     torch_dtype=torch.float16, device_map="cuda"
# )
# Memory: Llama-3-8B at 2bit = ~2.5GB vs 16GB in BF16
print("\nAQLM 2-bit: Llama-3-8B at ~2.5GB | Quality: similar to scalar INT4")

Deployment Scenario Guide

Scenario	Hardware	Recommended method	Format
GPU production server	A100/H100 80GB	GPTQ INT4 or AWQ INT4	Safetensors GPTQ
Consumer workstation	RTX 4090 24GB	GPTQ INT4 (up to 70B models)	Safetensors GPTQ
Windows/Linux laptop	8-16GB GPU VRAM	bitsandbytes NF4 or AWQ	HuggingFace Hub
Apple M-series laptop	16-96GB Unified Mem	GGUF Q4_K_M or Q5_K_M	GGUF + llama.cpp/Ollama
Raspberry Pi 5	8GB RAM	GGUF Q3_K_M or Q4_K_M (1-3B)	GGUF + llama.cpp
NVIDIA Jetson Orin	16GB unified mem	GPTQ INT4 or GGUF Q4_K_M	GPTQ or GGUF
Fine-tuning on limited GPU	RTX 3090 24GB	QLoRA (NF4 + LoRA)	bitsandbytes NF4

Conclusion

Model quantization has evolved from a memory-saving niche technique to a foundational tool for anyone working with LLMs. The 2026 landscape is rich: bitsandbytes for rapid prototyping and QLoRA fine-tuning, GPTQ for optimized GPU deployment, AWQ for heterogeneous and cross-platform hardware, and GGUF for CPU and edge devices.

The key insight is matching the method to the context. INT4 with GPTQ on an RTX 4090 is often faster than BF16 thanks to optimized kernels. GGUF Q4_K_M on an M3 MacBook Pro runs Llama-3.1-8B at 28 tokens/sec without a dedicated GPU. These are not compromises — they are new deployment paradigms that enable use cases previously impossible.

The natural next step is combining quantization with knowledge distillation, covered in our next article: how to transfer knowledge from a large quantized model to a smaller one, getting the best of both compression techniques.

Next Steps

Next article: Knowledge Distillation: Compress Models Efficiently
Related: Fine-Tuning with LoRA and QLoRA
See also: Running Local LLMs with Ollama
MLOps series: Model Serving and Deployment