Reusing Knowledge: The Idea Behind Transfer Learning
Transfer learning is one of the most revolutionary techniques in modern Machine Learning. The idea is simple but powerful: instead of training a model from scratch for every new problem, you start from a model already trained on a large dataset (the source domain) and adapt it to your specific problem (the target domain). This works because features learned in early layers (edges, textures, shapes for images; language structure for text) are universal and transferable.
Transfer learning has democratized deep learning: you no longer need a GPU cluster and millions of images to train an image classifier. You can take a pre-trained model like ResNet (trained on 1.2 million ImageNet images) and adapt it to your problem with a few hundred images and minutes of training.
What You Will Learn in This Article
- When and why transfer learning works
- Feature extraction vs fine-tuning
- Layer freezing strategies
- Pre-trained models for images and text
- Data augmentation for small datasets
- Practical implementation with Python
Feature Extraction: Freezing the Backbone
The simplest transfer learning strategy is feature extraction: use the pre-trained model as a fixed feature extractor, remove the last layer (the classification head), and add a new classifier trained on the target domain. The backbone weights remain frozen: they are not updated during training. This is ideal when the target dataset is small and the source domain is similar to the target.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_digits
import numpy as np
# Transfer learning simulation with scikit-learn
# The concept: use features learned by a pre-trained model
# as input for a lightweight classifier
# Dataset: digit recognition (8x8 pixels)
digits = load_digits()
X, y = digits.data, digits.target
# Scenario 1: Direct training (no transfer)
direct_pipeline = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression(max_iter=5000, random_state=42))
])
scores_direct = cross_val_score(direct_pipeline, X, y, cv=5, scoring='accuracy')
print(f"Direct (all features): {scores_direct.mean():.3f}")
# Scenario 2: Feature extraction simulation
# Pre-trained model has already extracted high-level features
# We use PCA as a proxy for "learned features"
from sklearn.decomposition import PCA
# Pre-trained feature extractor (freeze weights)
feature_extractor = PCA(n_components=20, random_state=42)
X_features = feature_extractor.fit_transform(X)
# New classifier on extracted features
transfer_pipeline = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression(max_iter=5000, random_state=42))
])
scores_transfer = cross_val_score(transfer_pipeline, X_features, y, cv=5, scoring='accuracy')
print(f"Feature extraction (20 comp): {scores_transfer.mean():.3f}")
# Scenario 3: With reduced dataset (simulates few target domain data)
small_idx = np.random.choice(len(X), size=200, replace=False)
X_small, y_small = X[small_idx], y[small_idx]
X_small_features = feature_extractor.transform(X_small)
scores_small_direct = cross_val_score(direct_pipeline, X_small, y_small, cv=5)
scores_small_transfer = cross_val_score(transfer_pipeline, X_small_features, y_small, cv=5)
print(f"\nWith only 200 samples:")
print(f" Direct: {scores_small_direct.mean():.3f}")
print(f" Feature extraction: {scores_small_transfer.mean():.3f}")
Fine-Tuning: Adapting the Model
Fine-tuning goes beyond feature extraction: after replacing the classification head, some backbone layers are unfrozen and retrained with a very low learning rate. This allows the model to also adapt intermediate features to the target domain. The general rule is: the more different the target domain is from the source, the more layers should be unfrozen.
Layer freezing strategies include: unfreezing only the last block of layers (conservative), progressively unfreezing from end to beginning (gradual unfreezing), and using different learning rates for different layers (discriminative learning rates, where deeper layers have a lower learning rate).
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_digits
import numpy as np
digits = load_digits()
X, y = digits.data, digits.target
# Fine-tuning simulation:
# Step 1: Pre-training on a subset (source domain)
# Step 2: Fine-tuning on another subset (target domain)
# Source domain: digits 0-4
source_mask = y <= 4
X_source, y_source = X[source_mask], y[source_mask]
# Target domain: digits 5-9 (with few data)
target_mask = y > 4
X_target, y_target = X[target_mask], y[target_mask] - 5
# Only 50 samples in target domain
np.random.seed(42)
small_idx = np.random.choice(len(X_target), size=50, replace=False)
X_target_small = X_target[small_idx]
y_target_small = y_target[small_idx]
# Model 1: Training from scratch on small target
from_scratch = GradientBoostingClassifier(
n_estimators=100, random_state=42
)
scores_scratch = cross_val_score(
from_scratch, X_target_small, y_target_small, cv=5
)
# Model 2: Pre-trained on source, fine-tuned on target
pretrained = GradientBoostingClassifier(
n_estimators=100, random_state=42,
warm_start=True # allows continuing training
)
# Pre-training on source
pretrained.fit(X_source, y_source)
# Fine-tuning: add more estimators on target
pretrained.n_estimators = 150 # add 50 estimators
pretrained.fit(X_target_small, y_target_small)
# Data Augmentation: add noise to expand dataset
noise_factor = 0.3
X_augmented = np.vstack([
X_target_small,
X_target_small + np.random.normal(0, noise_factor, X_target_small.shape),
X_target_small + np.random.normal(0, noise_factor, X_target_small.shape)
])
y_augmented = np.concatenate([y_target_small] * 3)
aug_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
scores_aug = cross_val_score(aug_model, X_augmented, y_augmented, cv=5)
print("Results with 50 target samples:")
print(f" From scratch: {scores_scratch.mean():.3f}")
print(f" Data augmentation: {scores_aug.mean():.3f}")
Pre-Trained Models in the Ecosystem
For images, the most used models are trained on ImageNet (1.2M images, 1000 classes): ResNet (residual architecture), EfficientNet (optimized for efficiency), VGG (simple but effective). For text, pre-trained transformer models dominate: BERT (Google, bidirectional), GPT (OpenAI, generative), RoBERTa (BERT optimization).
Hugging Face is the reference platform for pre-trained models: thousands of models for text
classification, translation, question answering, text generation and more, all accessible with a few lines
of code through the transformers library.
When Transfer Learning Makes Sense
Transfer learning is the right choice when: the target dataset is small (less than a few thousand samples), the source domain is similar to the target (natural images for classifying medical images), or when you want to save time and computational resources. It is not recommended when: source and target domains are very different (text vs images), the target dataset is already large, or source features are not useful for the target.
Training cost: Pre-training BERT from scratch requires 4 days on 16 TPUs (estimated cost $10K-$50K). Fine-tuning BERT on a specific task requires a few minutes on a single GPU. This cost difference explains why transfer learning is the standard choice in modern deep learning.
Key Takeaways
- Transfer learning reuses models pre-trained on large datasets for new problems
- Feature extraction: freeze the backbone, train only the new classification head
- Fine-tuning: progressively unfreeze layers with a low learning rate
- Works best when source and target domains are similar and target dataset is small
- ResNet/EfficientNet for images, BERT/GPT for text are the standard models
- Data augmentation complements transfer learning when data is scarce







