Model Quantization: INT8, INT4, GPTQ, AWQ and Beyond
A full-precision GPT-4 model occupies hundreds of gigabytes. Llama-3 70B in FP16 requires 140 GB of VRAM — making it impossible to run on consumer hardware without quantization. With INT4 quantization, the same Llama-3 70B drops to 35 GB, fitting in two RTX 4090s or a system with 64 GB of RAM. The accuracy loss? Often less than 1%.
Model quantization has evolved from a memory-saving trick into a foundational technique for the modern LLM ecosystem. Algorithms like GPTQ, AWQ, SmoothQuant and the GGUF format from llama.cpp have democratized access to large language models, making them deployable on consumer hardware, edge devices, and even Raspberry Pis.
This guide covers quantization end-to-end: from the underlying math to choosing the right method for your specific use case, with working code examples for each technique.
What You Will Learn
- Why quantization is essential for modern AI deployment
- PTQ vs QAT: when each approach makes sense
- INT8 quantization with bitsandbytes and SmoothQuant
- INT4 quantization with NF4, FP4, and QLoRA
- How the GPTQ algorithm works internally
- AWQ: activation-aware weight quantization advantages
- GGUF format and llama.cpp quantization levels
- Accuracy vs speed vs memory benchmarks
- Quantization-Aware Training with PyTorch
- Best practices and deployment decision guide
The VRAM Problem: Why Quantization Matters
A single parameter in FP32 takes 4 bytes. In FP16/BF16 it takes 2 bytes. With INT8 quantization, 1 byte; with INT4, only 0.5 bytes. The table below shows memory requirements for Llama-3 70B across precisions:
| Precision | Bytes/param | Memory (70B) | Hardware needed |
|---|---|---|---|
| FP32 | 4 bytes | 280 GB | Impossible on consumer |
| BF16 / FP16 | 2 bytes | 140 GB | 2x A100 80GB |
| INT8 | 1 byte | 70 GB | 1x A100 80GB |
| INT4 / NF4 | 0.5 bytes | 35 GB | 2x RTX 4090 (24GB) |
| INT3 / Q3_K_M | 0.375 bytes | ~26 GB | RTX 3090 + RAM offload |
Beyond VRAM, quantization delivers benefits in throughput (tokens/sec), inference latency, and cloud costs. On Apple M-series chips, Raspberry Pi, or NVIDIA Jetson, quantization is the only way to run models beyond 1B parameters.
Market Data 2026
Gartner forecasts that by 2027, quantized Small Language Models (SLMs) will surpass cloud LLMs in deployment frequency by a factor of 3x. On-device AI reduces operational costs by 70% compared to cloud APIs, eliminating network latency and per-token billing. Quantization has become mission-critical for edge AI, mobile deployment, and embedded systems.
Quantization Math: From Float to Integer
Quantization maps a continuous floating-point value to a discrete integer. The process involves two operations: quantization (float to int) and dequantization (int back to float for computation).
# Uniform INT8 quantization (fundamental scheme)
# W_quantized = round(W / scale) + zero_point
# W_dequantized = (W_quantized - zero_point) * scale
import torch
def quantize_tensor_int8(tensor: torch.Tensor, symmetric: bool = True):
"""
Uniform INT8 quantization.
symmetric=True: zero_point=0, range [-127, 127]
symmetric=False: asymmetric range with zero_point offset
"""
if symmetric:
max_val = tensor.abs().max().item()
scale = max_val / 127.0
zero_point = 0
else:
min_val = tensor.min().item()
max_val = tensor.max().item()
scale = (max_val - min_val) / 255.0
zero_point = round(-min_val / scale)
zero_point = max(0, min(255, zero_point))
# Quantize: FP16 -> INT8
quantized = torch.clamp(
torch.round(tensor / scale) + zero_point,
-128, 127
).to(torch.int8)
# Dequantize: INT8 -> FP16 (for error measurement)
dequantized = (quantized.float() - zero_point) * scale
error = (tensor - dequantized).abs().mean().item()
return quantized, scale, zero_point, error
# Practical example
W = torch.randn(1024, 1024, dtype=torch.float16)
q, scale, zp, err = quantize_tensor_int8(W, symmetric=True)
print(f"Original: {W.dtype}, {W.element_size() * W.numel() / 1024:.1f} KB")
print(f"Quantized: {q.dtype}, {q.element_size() * q.numel() / 1024:.1f} KB")
print(f"Memory reduction: {(1 - q.element_size()/W.element_size()) * 100:.0f}%")
print(f"Mean absolute error: {err:.6f}")
# Typical output:
# Original: torch.float16, 2048.0 KB
# Quantized: torch.int8, 1024.0 KB
# Memory reduction: 50%
# Mean absolute error: 0.000394
PTQ vs QAT: Choosing the Right Paradigm
Two fundamental paradigms exist for model quantization:
- PTQ (Post-Training Quantization): applied after training to an already trained model. Requires only a small calibration dataset. Fast and practical, but may degrade accuracy on small models or very low precision (INT2, INT3).
- QAT (Quantization-Aware Training): simulates quantization during training, allowing the model to adapt its weights to precision loss. Produces better results but requires compute resources comparable to full fine-tuning.
PTQ vs QAT: Practical Comparison
| Aspect | PTQ | QAT |
|---|---|---|
| Time required | Minutes to hours | Hours to days |
| Calibration data | Small (512-2048 samples) | Full training dataset |
| VRAM for quantization | Low (forward pass only) | High (full backward pass) |
| INT8 accuracy | Excellent (<0.5% loss) | Excellent |
| INT4 accuracy | Good (1-3% loss) | Very good (<1% loss) |
| Best for | Large models, fast production | Small models, max quality |
For modern LLMs (7B+ parameters), PTQ is generally sufficient: parameter redundancy means INT4 quantization preserves most capabilities. For models under 3B parameters, QAT is recommended when accuracy is critical. Recent PyTorch experiments show QAT can recover up to 96% of accuracy degradation on HellaSwag compared to PTQ for Llama-3.
INT8 with bitsandbytes: The Easiest Path
bitsandbytes is the most widely used library for practical LLM quantization. Originally developed by Tim Dettmers, it supports INT8 and INT4 (NF4, FP4) and integrates natively with Hugging Face Transformers. The key advantage: no calibration dataset needed, quantization happens on-the-fly at model load time.
# pip install bitsandbytes transformers accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
# === INT8 CONFIGURATION ===
config_int8 = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0, # Outlier threshold (optimal default)
llm_int8_has_fp16_weight=False
)
# === INT4 NF4 CONFIGURATION (QLoRA style) ===
config_int4 = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NF4: optimal for normally-distributed weights
bnb_4bit_compute_dtype=torch.bfloat16, # Compute happens in BF16 (not INT4!)
bnb_4bit_use_double_quant=True, # Double quantization: saves ~0.4 bits/param
bnb_4bit_quant_storage=torch.uint8 # Storage format
)
model_name = "meta-llama/Llama-3.1-8B-Instruct"
# Load with INT4 NF4
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=config_int4,
device_map="auto", # Automatically distributes across GPU/CPU
torch_dtype=torch.bfloat16,
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Check memory usage
mem_gb = model.get_memory_footprint() / 1024**3
print(f"Model memory (INT4): {mem_gb:.2f} GB")
# Llama-3.1-8B: ~4.5 GB vs 16 GB in BF16
# Inference is identical to full-precision model
inputs = tokenizer("Explain model quantization briefly:", return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=200,
temperature=0.7,
do_sample=True
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
bitsandbytes Limitations
- Quantization happens at load time: the quantized model is not saved automatically. Use GPTQ or AWQ to serialize quantized weights.
- bitsandbytes INT8 uses a mixed approach: weights at 8-bit, but activation outliers are handled in FP16 (LLM.int8() method).
- Computation always happens in BF16/FP16, not INT4: quantization reduces memory but does not accelerate compute as much as GPTQ/AWQ with optimized kernels.
- Performance on CPU-only systems without CUDA can be poor.
GPTQ: Layer-by-Layer Quantization
GPTQ (Frantar et al. 2022) is an advanced PTQ algorithm that quantizes each layer independently by minimizing reconstruction error. It uses the Hessian matrix (approximated via calibration data) to determine which weights are most sensitive to quantization and how to compensate for residual error column by column.
# pip install auto-gptq optimum
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset
import torch
model_name = "meta-llama/Llama-3.1-8B-Instruct"
output_dir = "./llama-3.1-8b-gptq-int4"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
# === CALIBRATION DATASET ===
# GPTQ needs 128-512 calibration samples
# More representative = better accuracy retention
calibration_data = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
calibration_texts = [
text for text in calibration_data["text"]
if len(text.strip()) > 100
][:128]
calibration_tokens = [
tokenizer(text, return_tensors="pt", truncation=True, max_length=2048)
for text in calibration_texts
]
# === QUANTIZATION CONFIGURATION ===
quantize_config = BaseQuantizeConfig(
bits=4, # INT4 quantization
group_size=128, # Group size (smaller = more accurate, more memory)
damp_percent=0.01, # Damping factor for numerical stability
desc_act=True, # Activation ordering (improves quality)
sym=True, # Symmetric quantization
)
# === LOAD AND QUANTIZE ===
print("Loading model in FP16...")
model = AutoGPTQForCausalLM.from_pretrained(
model_name,
quantize_config=quantize_config,
torch_dtype="auto"
)
print("Starting GPTQ quantization (~30-60 min on A100)...")
model.quantize(
calibration_tokens,
use_triton=False, # Set True for optimized Triton inference kernels
batch_size=1,
cache_examples_on_gpu=True
)
# Save quantized model
model.save_quantized(output_dir, use_safetensors=True)
tokenizer.save_pretrained(output_dir)
print(f"GPTQ model saved to: {output_dir}")
# === LOAD PRE-QUANTIZED GPTQ MODEL ===
model_gptq = AutoGPTQForCausalLM.from_quantized(
output_dir,
use_triton=False,
device_map="auto",
inject_fused_attention=True, # Memory-efficient attention
inject_fused_mlp=True # Memory-efficient MLP
)
inputs = tokenizer("The GPTQ algorithm works by:", return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model_gptq.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
GPTQ produces serializable quantized models. Many models on Hugging Face Hub with the
-GPTQ or -4bit suffix use this algorithm. Quantization takes
time (typically 30-90 minutes for a 13B model on A100) but happens once: the quantized
model can be reused without re-quantizing.
AWQ: Activation-Aware Weight Quantization
AWQ (Lin et al. 2023) starts from a different observation: not all weights are equally important. A small percentage (about 1%) corresponds to large-magnitude activations and contributes disproportionately to model predictions. Preserving these "salient weights" at higher precision dramatically reduces overall quantization error.
AWQ scales important weights before quantization, reducing error on critical channels. The result is quality comparable to or better than GPTQ, often with faster quantization and better performance on heterogeneous hardware (CPU, Mac M-series, mobile).
# pip install autoawq
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_name = "meta-llama/Llama-3.1-8B-Instruct"
output_dir = "./llama-3.1-8b-awq-int4"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
print("Loading model for AWQ quantization...")
model = AutoAWQForCausalLM.from_pretrained(
model_name,
safetensors=True,
device_map="cuda",
trust_remote_code=True
)
# AWQ configuration
quant_config = {
"zero_point": True, # Asymmetric quantization (better for LLMs)
"q_group_size": 128, # Group size
"w_bit": 4, # 4-bit quantization
"version": "GEMM" # GEMM: balanced speed/quality
# GEMV: optimized for batch_size=1 (chatbot)
}
# Domain-specific calibration data (use representative examples)
calib_data = [
"Quantization enables running large models on consumer hardware.",
"Transformers revolutionized NLP with the attention mechanism.",
"LoRA fine-tuning reduces trainable parameters significantly.",
# Add 128-256 representative examples for your target domain
]
print("Starting AWQ quantization...")
model.quantize(
tokenizer,
quant_config=quant_config,
calib_data=calib_data
)
model.save_quantized(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"AWQ model saved: {output_dir}")
# === LOAD AND RUN AWQ MODEL ===
model_awq = AutoAWQForCausalLM.from_quantized(
output_dir,
fuse_layers=True, # Fused kernel optimization
trust_remote_code=True,
safetensors=True
)
from transformers import pipeline
pipe = pipeline(
"text-generation",
model=model_awq,
tokenizer=tokenizer,
device_map="auto"
)
result = pipe("Explain AWQ quantization in one paragraph:", max_new_tokens=150)
print(result[0]["generated_text"])
GPTQ vs AWQ: Decision Guide
- GPTQ: best for NVIDIA GPU servers with CUDA. Fastest inference with Triton kernels. Industry standard for GPU deployment. Better for batch processing.
- AWQ: best for heterogeneous hardware (CPU, Mac, mobile). Faster to quantize. Preferred for chatbot apps (batch=1). Effective with GEMV kernel for single-token generation.
- Practical rule: GPTQ for dedicated GPU servers, AWQ for cross-platform and edge deployment.
GGUF and llama.cpp: Quantization for CPU and Edge
The GGUF format (GGML Unified Format) was created by the llama.cpp project to enable LLM inference on CPU, with optional GPU offloading via Metal (Apple), CUDA, or OpenCL. GGUF succeeds GGML and resolves forward-compatibility issues between versions.
GGUF naming follows the pattern Q[bits]_[variant]. The most common formats:
| Format | Avg bits | Quality | Recommended for |
|---|---|---|---|
| Q8_0 | 8.0 bits | Near-lossless | Maximum quality on powerful CPU |
| Q6_K | 6.6 bits | Excellent | Quality/size balance |
| Q5_K_M | 5.7 bits | Very good | Desktop with 16+ GB RAM |
| Q4_K_M | 4.8 bits | Good (95%) | Recommended default, 8+ GB laptop |
| Q3_K_M | 3.9 bits | Acceptable | Very memory-constrained hardware |
| Q2_K | 2.6 bits | Degraded | Extreme testing only |
The _K_M suffix indicates "K-quantization" at "Medium" size: a technique that uses block quantization with higher-precision scale factors for critical layers, yielding better quality than uniform quantization at the same bit width.
# Convert and quantize with llama.cpp
# First: build llama.cpp
# git clone https://github.com/ggerganov/llama.cpp
# cd llama.cpp && make -j4
# 1. Convert HuggingFace model -> GGUF FP16
python convert_hf_to_gguf.py \
meta-llama/Llama-3.1-8B-Instruct \
--outfile llama-3.1-8b-f16.gguf \
--outtype f16
# 2. Quantize to different formats
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q8_0.gguf Q8_0
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q5_k_m.gguf Q5_K_M
# 3. Benchmark with llama.cpp
./llama-bench \
-m llama-3.1-8b-q4_k_m.gguf \
-p 512 \ # Prompt tokens
-n 128 \ # Tokens to generate
-t 8 # CPU threads
# Typical output on M2 Pro (16 GB):
# Q4_K_M: prompt 45.2 t/s, generate 28.1 t/s
# Q8_0: prompt 24.1 t/s, generate 15.8 t/s
# === PYTHON USAGE via llama-cpp-python ===
# pip install llama-cpp-python
from llama_cpp import Llama
llm = Llama(
model_path="./llama-3.1-8b-q4_k_m.gguf",
n_ctx=4096, # Context window
n_threads=8, # CPU threads
n_gpu_layers=35, # Offload 35 layers to GPU (0 = CPU only)
verbose=False
)
response = llm(
"Q: How does model quantization work? A:",
max_tokens=256,
stop=["Q:", "\n\n"],
echo=True
)
print(response["choices"][0]["text"])
# === GGUF WITH OLLAMA (simpler approach) ===
modelfile_content = """
FROM ./llama-3.1-8b-q4_k_m.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
SYSTEM "You are a technical assistant specializing in deep learning."
"""
with open("Modelfile", "w") as f:
f.write(modelfile_content)
# ollama create my-llama -f Modelfile
# ollama run my-llama
Accuracy vs Speed vs Memory Benchmarks
Standard metrics for evaluating quantized model quality include perplexity on Wikitext-2, reasoning benchmarks (HellaSwag, MMLU), and domain-specific tasks.
# Automated evaluation with lm-evaluation-harness
# pip install lm-eval
# BF16 baseline evaluation
lm_eval --model hf \
--model_args "pretrained=meta-llama/Llama-3.1-8B-Instruct" \
--tasks hellaswag,mmlu \
--batch_size 4 \
--output_path results_bf16/
# GPTQ INT4 evaluation
lm_eval --model hf \
--model_args "pretrained=./llama-3.1-8b-gptq-int4,use_auto_gptq=True" \
--tasks hellaswag,mmlu \
--batch_size 4 \
--output_path results_gptq_int4/
# === MEMORY AND SPEED BENCHMARK SCRIPT ===
import torch
import time
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
def benchmark_model(model_name_or_path, quant_config=None, n_tokens=100, n_runs=5):
"""Complete benchmark: memory, latency, throughput."""
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(
model_name_or_path,
quantization_config=quant_config,
device_map="cuda",
torch_dtype=torch.bfloat16 if quant_config is None else None
)
mem_gb = model.get_memory_footprint() / 1024**3
prompt = "Explain the transformer architecture in detail:" * 3
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
# Warm-up
with torch.no_grad():
model.generate(**inputs, max_new_tokens=10)
# Benchmark
torch.cuda.synchronize()
latencies = []
for _ in range(n_runs):
start = time.perf_counter()
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=n_tokens)
torch.cuda.synchronize()
latencies.append(time.perf_counter() - start)
avg_latency = sum(latencies) / len(latencies)
throughput = n_tokens / avg_latency
return {
"memory_gb": round(mem_gb, 2),
"latency_sec": round(avg_latency, 3),
"throughput_tps": round(throughput, 1)
}
# Compare BF16 vs INT8 vs INT4
results = {}
results["BF16"] = benchmark_model("meta-llama/Llama-3.1-8B-Instruct")
config_8bit = BitsAndBytesConfig(load_in_8bit=True)
results["INT8"] = benchmark_model("meta-llama/Llama-3.1-8B-Instruct", config_8bit)
config_4bit = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True
)
results["INT4-NF4"] = benchmark_model("meta-llama/Llama-3.1-8B-Instruct", config_4bit)
for name, r in results.items():
print(f"{name:10} | Mem: {r['memory_gb']:5.2f} GB | "
f"Latency: {r['latency_sec']:.3f}s | "
f"Throughput: {r['throughput_tps']:.1f} t/s")
# Typical results for Llama-3.1-8B on RTX 3090:
# BF16 | Mem: 16.02 GB | Latency: 3.821s | Throughput: 26.2 t/s
# INT8 | Mem: 8.51 GB | Latency: 4.103s | Throughput: 24.4 t/s
# INT4-NF4 | Mem: 4.89 GB | Latency: 3.412s | Throughput: 29.3 t/s
Benchmark Summary (Llama-3.1-8B on RTX 4090)
| Method | Memory | Throughput | HellaSwag | Perplexity |
|---|---|---|---|---|
| BF16 (baseline) | 16.0 GB | 38 t/s | 82.1% | 6.14 |
| INT8 (bitsandbytes) | 8.5 GB | 35 t/s | 81.8% | 6.21 |
| INT4 NF4 (bnb) | 4.9 GB | 42 t/s | 81.2% | 6.47 |
| GPTQ INT4 | 4.8 GB | 55 t/s | 81.5% | 6.39 |
| AWQ INT4 | 4.7 GB | 52 t/s | 81.6% | 6.35 |
| Q4_K_M (GGUF, CPU) | 4.9 GB | 18 t/s | 81.3% | 6.42 |
Note: indicative values, vary with hardware, specific model, and batch size.
Quantization-Aware Training with PyTorch
For scenarios where PTQ is insufficient — typically models under 3B parameters or quantization below 4 bits — QAT allows recovering significant accuracy. PyTorch includes native QAT support from version 2.0, with static and dynamic INT8 support.
import torch
import torch.nn as nn
from torch.ao.quantization import (
prepare_qat_fx, convert_fx,
get_default_qat_qconfig_mapping
)
# === MODEL DEFINITION ===
class SimpleTransformerBlock(nn.Module):
def __init__(self, d_model=256, nhead=4, ff_dim=1024):
super().__init__()
self.attn = nn.MultiheadAttention(d_model, nhead, batch_first=True)
self.norm1 = nn.LayerNorm(d_model)
self.ff = nn.Sequential(
nn.Linear(d_model, ff_dim),
nn.ReLU(),
nn.Linear(ff_dim, d_model)
)
self.norm2 = nn.LayerNorm(d_model)
def forward(self, x):
attn_out, _ = self.attn(x, x, x)
x = self.norm1(x + attn_out)
return self.norm2(x + self.ff(x))
class SimpleModel(nn.Module):
def __init__(self, vocab_size=1000, d_model=256, n_layers=4):
super().__init__()
self.embed = nn.Embedding(vocab_size, d_model)
self.blocks = nn.Sequential(*[
SimpleTransformerBlock(d_model) for _ in range(n_layers)
])
self.head = nn.Linear(d_model, vocab_size)
def forward(self, x):
return self.head(self.blocks(self.embed(x)))
# === QAT SETUP ===
model = SimpleModel()
model.train()
# QConfig specifies how to quantize activations and weights
qconfig_mapping = get_default_qat_qconfig_mapping("x86")
# Trace model with example input
example_input = torch.randint(0, 1000, (2, 32))
# Prepare for QAT: inserts FakeQuantize nodes that simulate
# quantization during the forward pass
model_prepared = prepare_qat_fx(model, qconfig_mapping, example_input)
# === QAT TRAINING LOOP ===
optimizer = torch.optim.Adam(model_prepared.parameters(), lr=1e-4)
criterion = nn.CrossEntropyLoss()
def train_qat(model, n_epochs=10, freeze_quantizer_epoch=8):
for epoch in range(n_epochs):
# Synthetic data (replace with real dataset)
x = torch.randint(0, 1000, (32, 64))
y = torch.randint(0, 1000, (32, 64))
optimizer.zero_grad()
out = model(x)
loss = criterion(out.view(-1, 1000), y.view(-1))
loss.backward()
optimizer.step()
# Freeze quantizer after N epochs: stabilizes scale factors
if epoch == freeze_quantizer_epoch:
model.apply(torch.quantization.disable_observer)
print(f"Epoch {epoch}: FakeQuantize observers frozen")
if epoch % 2 == 0:
print(f"Epoch {epoch}/{n_epochs}, Loss: {loss.item():.4f}")
train_qat(model_prepared)
# === CONVERT TO REAL INT8 ===
model_prepared.eval()
model_int8 = convert_fx(model_prepared)
torch.save(model_int8.state_dict(), "model_qat_int8.pt")
print("QAT model saved as INT8 successfully")
Best Practices and Anti-Patterns
Quantization Best Practices
- Match method to deployment target: GPTQ INT4 for GPU servers; GGUF Q4_K_M for CPU/edge; NF4 with bitsandbytes for fine-tuning workflows.
- Use representative calibration data: for GPTQ and AWQ, use domain-specific samples, not generic text. 128-512 samples are enough but must be meaningful.
- Always benchmark on domain-specific tasks: Wikitext perplexity is a proxy metric that may miss quality degradation on code, math, or non-English languages.
- Optimal group size: group_size=128 is the safe default; 64 improves quality at higher memory cost; 256 saves memory but hurts quality.
- Enable double quantization: always set bnb_4bit_use_double_quant=True; it saves ~0.4 bits/param with minimal quality impact.
- Use BF16 as compute dtype: bnb_4bit_compute_dtype=torch.bfloat16 is more numerically stable than FP16 and supported on Ampere+ (RTX 3000+, A100).
Anti-Patterns to Avoid
- Never quantize critical layers uniformly: first and last layers (embedding, LM head) are more sensitive. GPTQ and AWQ exclude them automatically; replicate this behavior in custom implementations.
- Avoid FP4 for LLM inference: NF4 is superior to FP4 for LLM weights because they follow a normal distribution. NF4 is information-theoretically optimal for normally distributed values.
- Never compare methods without equal benchmarks: INT4 with GPTQ and INT4 with bitsandbytes have different effective quality. Always run the same evaluation suite before conclusions.
- Watch for dequantization overhead: bitsandbytes INT4 must dequantize during each forward pass. On memory-bandwidth-limited GPUs, INT8 may outperform INT4.
- Do not use Q2_K in production: quality degrades too much for most use cases. Q3_K_M is the minimum acceptable level for simple tasks.
SmoothQuant: Solving the Activation Outlier Problem
A fundamental challenge in LLM quantization is the presence of activation outliers: a small number of channels (typically 0.1-1% of hidden dimensions) have magnitudes 100-1000x larger than the rest. Quantizing these with the same scale as normal activations causes catastrophic precision loss for the majority of values. SmoothQuant (Xiao et al., 2022) elegantly solves this by migrating quantization difficulty from activations to weights, which are inherently easier to quantize.
The core idea: if activations have outlier channels, multiply them by a smoothing factor
s (reducing their magnitude) and divide the corresponding weight channels by the
same s (compensating mathematically). After smoothing, both activations and weights
fall within INT8 range. This is a mathematically equivalent transformation that enables true
W8A8 (weights INT8, activations INT8) quantization on hardware.
import torch
import torch.nn as nn
# ===================================================================
# SMOOTHQUANT: MIGRATING QUANTIZATION DIFFICULTY
# ===================================================================
# Key insight: Y = (X * diag(s)^-1) @ (diag(s) * W^T)
# = X_smooth @ W_smooth^T
# Where s is a per-channel smoothing factor that balances difficulty.
# After this transform, both X_smooth and W_smooth are easy to quantize.
def compute_smoothing_factors(
activation_stats: torch.Tensor, # Per-channel max: [hidden_dim]
weight: torch.Tensor, # [out_features, in_features]
alpha: float = 0.5 # Migration strength (0=weight-side, 1=act-side)
) -> torch.Tensor:
"""
Compute per-channel smoothing factors s.
alpha controls the migration balance:
- alpha=0.5: balanced (default, works for most models)
- alpha closer to 1.0: more migration to weights (helps when activations are very spiky)
- alpha closer to 0.0: more migration to activations (helps when weights are spiky)
Formula: s_j = max(|X_j|)^alpha / max(|W_j|)^(1-alpha)
"""
# Per-channel stats for weights: max absolute value per input feature
weight_max = weight.abs().max(dim=0).values # [in_features]
act_max = activation_stats.clamp(min=1e-8) # [in_features] - avoid division by zero
# Smoothing factor balances the quantization ranges
smooth_factor = (act_max ** alpha) / (weight_max ** (1 - alpha))
smooth_factor = smooth_factor.clamp(min=1e-8) # Numerical stability
return smooth_factor
class SmoothQuantLinear(nn.Module):
"""
Linear layer with SmoothQuant applied.
In practice, SmoothQuant is applied offline (pre-deployment) to
the pre-trained model weights. Here we show the conceptual approach.
"""
def __init__(
self,
in_features: int,
out_features: int,
smooth_factor: torch.Tensor = None
):
super().__init__()
self.linear = nn.Linear(in_features, out_features)
if smooth_factor is not None:
self.register_buffer('smooth_factor', smooth_factor)
# Apply smoothing to weights offline: W_smooth = W * diag(s)
with torch.no_grad():
self.linear.weight.data = (
self.linear.weight.data / smooth_factor.unsqueeze(0)
)
else:
self.register_buffer('smooth_factor', torch.ones(in_features))
def forward(self, x: torch.Tensor) -> torch.Tensor:
# Apply inverse smoothing to activations: X_smooth = X / diag(s)
x_smooth = x / self.smooth_factor
# The weight is already smoothed offline
return self.linear(x_smooth)
# ===================================================================
# PRACTICAL SMOOTHQUANT APPLICATION
# ===================================================================
def apply_smoothquant(
model: nn.Module,
calibration_activations: dict, # {layer_name: activation_max_per_channel}
alpha: float = 0.5
) -> nn.Module:
"""
Apply SmoothQuant to all Linear layers in a model.
calibration_activations: collected during calibration forward pass.
"""
for name, module in model.named_modules():
if not isinstance(module, nn.Linear):
continue
if name not in calibration_activations:
continue
act_max = calibration_activations[name] # [in_features]
smooth_factor = compute_smoothing_factors(
act_max,
module.weight,
alpha=alpha
)
# Apply smoothing to weight (offline transformation)
with torch.no_grad():
module.weight.data = module.weight.data / smooth_factor.unsqueeze(0)
# Smoothing of activations happens at inference via the preceding LayerNorm
# (the scale is absorbed into the LayerNorm weight, making it zero-overhead)
if hasattr(module, '_smooth_factor'):
module._smooth_factor = smooth_factor
else:
module.register_buffer = lambda name, tensor: None # Simplified
return model
# ===================================================================
# DEMONSTRATION: OUTLIER ANALYSIS
# ===================================================================
def analyze_activation_outliers(activations: torch.Tensor, threshold: float = 6.0):
"""
Analyze activation outliers (the problem SmoothQuant solves).
LLM.int8() uses threshold=6.0 as default.
"""
B, N, D = activations.shape if activations.dim() == 3 else (1, 1, activations.shape[-1])
flat = activations.abs()
outlier_channels = (flat.max(dim=0).values > threshold).sum().item()
total_channels = D
outlier_ratio = outlier_channels / total_channels
print(f"Activation analysis:")
print(f" Max activation: {flat.max().item():.2f}")
print(f" Mean activation: {flat.mean().item():.4f}")
print(f" Outlier channels (>{threshold}): {outlier_channels}/{total_channels} ({outlier_ratio:.1%})")
print(f" Quantization range without smoothing: [{-flat.max().item():.2f}, {flat.max().item():.2f}]")
print(f" -> INT8 step size: {flat.max().item()*2/255:.4f} (very coarse for normal values!)")
return outlier_channels, outlier_ratio
# Simulate LLM activations (Gaussian + heavy-tail outliers in some channels)
torch.manual_seed(42)
d_model = 4096
n_samples = 128
activations = torch.randn(n_samples, d_model)
# Add outliers to ~0.5% of channels
outlier_channels_idx = torch.randperm(d_model)[:int(d_model * 0.005)]
activations[:, outlier_channels_idx] *= 150 # 150x larger than normal
analyze_activation_outliers(activations, threshold=6.0)
# Output:
# Max activation: ~212.5
# Mean activation: 0.4821
# Outlier channels (>6.0): ~20/4096 (0.5%)
# INT8 step size: 1.667 (horrible precision for values < 5!)
# After SmoothQuant: max ~5.2, INT8 step ~0.041 (40x better!)
Modern Ultra-Low Precision: INT2 and Extreme Quantization
The frontier of quantization has pushed beyond INT4 toward extreme compression. In 2024-2025, several methods achieved INT2-INT3 quantization with acceptable quality for many use cases. QuIP# (Tseng et al., 2024) uses randomized Hadamard transforms to incoherize weight matrices before quantization, enabling true 2-bit quantization. AQLM (Additive Quantization of Language Models) uses vector quantization codebooks. These methods target a new edge deployment tier: 70B models in under 20 GB RAM.
import torch
import torch.nn as nn
import math
# ===================================================================
# HADAMARD RANDOMIZED QUANTIZATION (QuIP# principle)
# ===================================================================
# Core idea: applying a random orthogonal transformation to weights
# makes the quantization error more uniform (incoherent).
# Incoherence = no single dimension dominates the error.
# This allows 2-bit quantization with much less quality loss.
def hadamard_transform(x: torch.Tensor, normalize: bool = True) -> torch.Tensor:
"""
Applies the Walsh-Hadamard transform to the last dimension.
H_n is a 2^n x 2^n matrix with entries +1/-1, orthogonal.
Transform: y = H @ x / sqrt(n)
"""
d = x.shape[-1]
# d must be a power of 2
assert d > 0 and (d & (d - 1)) == 0, "Dimension must be power of 2"
x = x.clone()
n = d
h = 1
while h < n:
for i in range(0, n, h * 2):
for j in range(i, i + h):
a = x[..., j].clone()
b = x[..., j + h].clone()
x[..., j] = a + b
x[..., j + h] = a - b
h *= 2
if normalize:
x = x / math.sqrt(d)
return x
class QuIPQuantizer:
"""
QuIP# style quantizer: randomized incoherence preprocessing + INT2.
Simplified version for educational purposes.
Production QuIP# uses optimized CUDA kernels for the Hadamard transform.
"""
def __init__(self, n_bits: int = 2, group_size: int = 256):
self.n_bits = n_bits
self.group_size = group_size
self.n_levels = 2 ** n_bits # 4 levels for INT2, 16 for INT4
def quantize_weight(self, W: torch.Tensor, seed: int = 42) -> tuple:
"""
QuIP#-style quantization:
1. Apply random orthogonal transform (Hadamard)
2. Quantize in transformed space
3. Store transform seed (not the full matrix)
"""
out_features, in_features = W.shape
# Pad to power of 2 if needed
pad_to = 2 ** math.ceil(math.log2(in_features))
if pad_to > in_features:
W_padded = torch.zeros(out_features, pad_to)
W_padded[:, :in_features] = W
else:
W_padded = W
# Randomize column order (simulates random rotation)
torch.manual_seed(seed)
perm = torch.randperm(pad_to)
W_perm = W_padded[:, perm]
# Apply Hadamard transform per row
W_transformed = hadamard_transform(W_perm, normalize=True)
# Quantize to n_bits
W_q, scale, _ = self._uniform_quantize(W_transformed[:, :in_features])
return W_q, scale, perm[:in_features], seed
def _uniform_quantize(self, x: torch.Tensor) -> tuple:
"""Group-wise uniform quantization."""
B, D = x.shape
n_groups = max(1, D // self.group_size)
x_grouped = x[:, :n_groups * self.group_size].reshape(B * n_groups, self.group_size)
scale = x_grouped.abs().max(dim=-1, keepdim=True).values / ((self.n_levels // 2) - 1)
scale = scale.clamp(min=1e-8)
x_q = torch.round(x_grouped / scale).clamp(-(self.n_levels // 2), self.n_levels // 2 - 1)
x_q = x_q.reshape(B, n_groups * self.group_size)
scale = scale.reshape(B, n_groups)
return x_q.to(torch.int8), scale, n_groups
# ===================================================================
# GGUF ULTRA-LOW PRECISION BENCHMARKS
# ===================================================================
gguf_ultra_low = [
{"format": "Q2_K", "bits": 2.6, "llama8b_gb": 2.9, "ppl_loss": "+4.2", "usable": False},
{"format": "Q3_K_S", "bits": 3.0, "llama8b_gb": 3.4, "ppl_loss": "+2.1", "usable": True},
{"format": "Q3_K_M", "bits": 3.9, "llama8b_gb": 3.9, "ppl_loss": "+1.4", "usable": True},
{"format": "Q4_K_M", "bits": 4.8, "llama8b_gb": 4.9, "ppl_loss": "+0.6", "usable": True},
{"format": "Q5_K_M", "bits": 5.7, "llama8b_gb": 5.7, "ppl_loss": "+0.3", "usable": True},
{"format": "Q6_K", "bits": 6.6, "llama8b_gb": 6.6, "ppl_loss": "+0.1", "usable": True},
{"format": "Q8_0", "bits": 8.0, "llama8b_gb": 8.1, "ppl_loss": "~0.0", "usable": True},
]
print(f"\n{'Format':<12} {'Avg bits':>10} {'Size (8B)':>12} {'PPL loss':>10} {'Usable?':>10}")
print("-" * 60)
for fmt in gguf_ultra_low:
print(f"{fmt['format']:<12} {fmt['bits']:>10.1f} {str(fmt['llama8b_gb'])+'GB':>12} "
f"{fmt['ppl_loss']:>10} {str(fmt['usable']):>10}")
print("\nRecommendation:")
print(" Minimum for production: Q3_K_M (3.9 bits, +1.4 perplexity)")
print(" Sweet spot: Q4_K_M (4.8 bits, +0.6 perplexity)")
print(" High quality: Q5_K_M or Q6_K")
print(" Q2_K is only for extreme RAM constraints (testing only)")
# ===================================================================
# AQLM: ADDITIVE QUANTIZATION (2-bit via codebook)
# ===================================================================
# AQLM represents weight sub-vectors using a learned codebook.
# Instead of scalar quantization (each weight -> int), it uses
# vector quantization: groups of d weights -> one codebook index.
# This achieves effective 2-bit with much better quality than
# scalar INT2, because the codebook captures weight correlations.
#
# AQLM codebook structure:
# - C codebooks, each with K entries of size d
# - A weight vector w of size d is encoded as: w ≈ sum_c(C_c[I_c])
# - For d=8, K=256, C=2: effective 2 bits/param (2*8bits/8weights = 2bpw)
#
# Loading AQLM-quantized model (example):
# from transformers import AutoModelForCausalLM
# model = AutoModelForCausalLM.from_pretrained(
# "ISTA-DASLab/Llama-3-8B-AQLM-2Bit-1x16",
# torch_dtype=torch.float16, device_map="cuda"
# )
# Memory: Llama-3-8B at 2bit = ~2.5GB vs 16GB in BF16
print("\nAQLM 2-bit: Llama-3-8B at ~2.5GB | Quality: similar to scalar INT4")
Deployment Scenario Guide
| Scenario | Hardware | Recommended method | Format |
|---|---|---|---|
| GPU production server | A100/H100 80GB | GPTQ INT4 or AWQ INT4 | Safetensors GPTQ |
| Consumer workstation | RTX 4090 24GB | GPTQ INT4 (up to 70B models) | Safetensors GPTQ |
| Windows/Linux laptop | 8-16GB GPU VRAM | bitsandbytes NF4 or AWQ | HuggingFace Hub |
| Apple M-series laptop | 16-96GB Unified Mem | GGUF Q4_K_M or Q5_K_M | GGUF + llama.cpp/Ollama |
| Raspberry Pi 5 | 8GB RAM | GGUF Q3_K_M or Q4_K_M (1-3B) | GGUF + llama.cpp |
| NVIDIA Jetson Orin | 16GB unified mem | GPTQ INT4 or GGUF Q4_K_M | GPTQ or GGUF |
| Fine-tuning on limited GPU | RTX 3090 24GB | QLoRA (NF4 + LoRA) | bitsandbytes NF4 |
Conclusion
Model quantization has evolved from a memory-saving niche technique to a foundational tool for anyone working with LLMs. The 2026 landscape is rich: bitsandbytes for rapid prototyping and QLoRA fine-tuning, GPTQ for optimized GPU deployment, AWQ for heterogeneous and cross-platform hardware, and GGUF for CPU and edge devices.
The key insight is matching the method to the context. INT4 with GPTQ on an RTX 4090 is often faster than BF16 thanks to optimized kernels. GGUF Q4_K_M on an M3 MacBook Pro runs Llama-3.1-8B at 28 tokens/sec without a dedicated GPU. These are not compromises — they are new deployment paradigms that enable use cases previously impossible.
The natural next step is combining quantization with knowledge distillation, covered in our next article: how to transfer knowledge from a large quantized model to a smaller one, getting the best of both compression techniques.
Next Steps
- Next article: Knowledge Distillation: Compress Models Efficiently
- Related: Fine-Tuning with LoRA and QLoRA
- See also: Running Local LLMs with Ollama
- MLOps series: Model Serving and Deployment







