Introduction: Reusing Model Knowledge
Advanced Transfer Learning is the key to leveraging models pre-trained on massive datasets without needing the computational resources for training from scratch. Models like BERT and GPT were trained on hundreds of billions of tokens, capturing deep language understanding that can be transferred to specific tasks with limited data and resources.
In this article we will explore modern fine-tuning strategies, prompt engineering, Retrieval-Augmented Generation (RAG), and the Hugging Face ecosystem, with comparisons between open-source models like Llama, Mistral, and Falcon.
What You Will Learn
- BERT: bidirectional pre-training and fine-tuning for understanding tasks
- GPT: auto-regressive generation and in-context learning
- Fine-tuning strategies: full, LoRA, adapters, and QLoRA
- Prompt engineering: techniques for better outputs
- RAG: combining LLMs with search for accurate answers
- Open-source models: Llama, Mistral, Falcon - when to use which
- Hugging Face Hub: the pre-trained model ecosystem
BERT: Bidirectional Text Understanding
BERT (Bidirectional Encoder Representations from Transformers) revolutionized NLP by demonstrating that bidirectional pre-training produces extraordinarily rich language representations. During pre-training, BERT uses two objectives:
- Masked Language Modeling (MLM): 15% of tokens are masked and the model must predict them from bidirectional context
- Next Sentence Prediction (NSP): the model predicts whether two sentences are consecutive in the original text
For fine-tuning, simply add a classification layer on top of BERT's output and train the entire model on a few thousand labeled examples:
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import torch
# Load BERT for sentiment classification
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained(
'bert-base-uncased',
num_labels=2 # positive/negative
)
# Tokenize data
texts = ["This movie is great!", "Terrible waste of time."]
labels = [1, 0] # 1=positive, 0=negative
inputs = tokenizer(texts, padding=True, truncation=True,
max_length=128, return_tensors="pt")
inputs['labels'] = torch.tensor(labels)
# Forward pass
outputs = model(**inputs)
print(f"Loss: {outputs.loss:.4f}")
print(f"Logits: {outputs.logits}")
# Fine-tuning with Trainer API
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=16,
learning_rate=2e-5,
weight_decay=0.01,
warmup_steps=100,
evaluation_strategy="epoch"
)
Efficient Fine-Tuning: LoRA and QLoRA
Full fine-tuning of models with billions of parameters requires enormous resources. Parameter-Efficient Fine-Tuning (PEFT) allows adapting the model by modifying only a small fraction of parameters:
LoRA (Low-Rank Adaptation)
LoRA freezes the original model weights and adds trainable low-rank matrices alongside attention layers. It typically modifies less than 1% of total parameters, achieving performance comparable to full fine-tuning.
QLoRA
QLoRA combines LoRA with 4-bit quantization, enabling fine-tuning of 65B parameter models on a single GPU with 48GB VRAM. It uses the NF4 (NormalFloat 4-bit) data type and double quantization for maximum efficiency.
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
# LoRA configuration
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank of LoRA matrices
lora_alpha=32, # Scaling factor
lora_dropout=0.1,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
bias="none"
)
# Apply LoRA to model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
peft_model = get_peft_model(model, lora_config)
# Count trainable parameters
trainable = sum(p.numel() for p in peft_model.parameters() if p.requires_grad)
total = sum(p.numel() for p in peft_model.parameters())
print(f"Trainable: {trainable:,} / {total:,} "
f"({100*trainable/total:.2f}%)")
# Output: ~0.5% of parameters are trainable
Prompt Engineering: The Art of Communicating with LLMs
Prompt engineering is the practice of formulating instructions that guide the model toward the desired output without modifying its weights. Key techniques include: few-shot learning (providing examples in the prompt), chain-of-thought (asking the model to reason step by step), role prompting (assigning a specific role to the model), and structured output (requesting specific formats like JSON).
RAG: Retrieval-Augmented Generation
RAG combines the generative capability of LLMs with a search system to provide accurate answers based on specific documents. Instead of relying solely on knowledge memorized during pre-training, the model receives relevant context retrieved from a document database.
The RAG process consists of three phases:
- Indexing: documents are split into chunks and transformed into vector embeddings
- Retrieval: given a query, the most similar chunks are retrieved via similarity search
- Generation: retrieved chunks are inserted into the prompt as context for the LLM
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
# 1. Document splitting
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " "]
)
chunks = text_splitter.split_text(document_text)
# 2. Create embeddings and vector store
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
vector_store = FAISS.from_texts(chunks, embeddings)
# 3. Retrieval and generation
query = "How does transfer learning work?"
relevant_docs = vector_store.similarity_search(query, k=3)
# Build prompt with context
context = "\n".join([doc.page_content for doc in relevant_docs])
prompt = f"""Based on the following context, answer the question.
Context:
{context}
Question: {query}
Answer:"""
Open-Source Models: Llama, Mistral, Falcon
The open-source model ecosystem has exploded, offering competitive alternatives to proprietary models:
- Llama (Meta): model family from 7B to 70B parameters, excellent for fine-tuning and on-premise deployment. Llama 3 achieves competitive performance with GPT-3.5
- Mistral: efficient models with innovative architecture (Sliding Window Attention, Mixture of Experts). Mistral 7B outperforms Llama 2 13B on many benchmarks
- Falcon: trained on high-quality dataset (RefinedWeb), offers good zero-shot performance
The choice depends on the use case: for general text generation, Llama 3 is often the best choice; for efficiency on limited resources, Mistral 7B is ideal; for specific tasks, fine-tuning with LoRA on any of these models produces excellent results.
Hugging Face: The Complete Ecosystem
Hugging Face has become the reference point for deep learning NLP, offering a complete ecosystem:
- Model Hub: over 500,000 pre-trained models, downloadable with a single line of code
- Transformers Library: unified APIs for all models (BERT, GPT, T5, Llama, etc.)
- Datasets: thousands of datasets for training and evaluation
- Trainer API: optimized training loop with distributed training, mixed precision, gradient accumulation
- Spaces: free hosting for ML demos and apps
Next Steps in the Series
- In the next article we will explore TinyML and Edge AI
- We will see how to deploy deep learning models on embedded devices and smartphones
- We will analyze quantization, pruning, and knowledge distillation for model compression







