The Fundamental Problem of Machine Learning
The bias-variance tradeoff is the most important concept for understanding why an ML model works or fails. Every model has two sources of error: bias (how much the model's assumptions deviate from reality) and variance (how sensitive the model is to fluctuations in training data). The goal is finding the optimal balance point.
Overfitting occurs when the model memorizes training data including noise, achieving excellent performance on the training set but poor performance on new data. Underfitting occurs when the model is too simple to capture the real patterns in the data. Recognizing and solving these problems is one of the most valuable skills in ML.
What You Will Learn in This Article
- Bias-Variance tradeoff and how to diagnose it
- Signs of overfitting and underfitting
- Learning curves for diagnosis
- Cross-validation strategies
- L1 (Lasso) and L2 (Ridge) regularization
- Early stopping and data augmentation
Diagnosing Overfitting and Underfitting
The most direct way to diagnose overfitting and underfitting is comparing performance on the training set and test set. If the model has high performance on training but low on test, it is overfitting. If it has low performance on both, it is underfitting. Learning curves visualize how performance changes with varying data amounts or model complexity.
from sklearn.model_selection import learning_curve, validation_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
import numpy as np
# Dataset
data = load_breast_cancer()
X, y = data.data, data.target
# Learning curve: performance vs training set size
train_sizes, train_scores, val_scores = learning_curve(
DecisionTreeClassifier(random_state=42),
X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5,
scoring='accuracy',
n_jobs=-1
)
print("Learning Curve (Decision Tree without limits):")
print(f"{'Train Size':<12s} {'Train Acc':<12s} {'Val Acc':<12s} {'Gap':<8s}")
for size, train, val in zip(
train_sizes,
train_scores.mean(axis=1),
val_scores.mean(axis=1)
):
gap = train - val
status = "OVERFIT" if gap > 0.05 else "OK"
print(f"{size:<12d} {train:.3f} {val:.3f} {gap:.3f} {status}")
# Validation curve: performance vs model complexity
param_range = range(1, 20)
train_scores_vc, val_scores_vc = validation_curve(
DecisionTreeClassifier(random_state=42),
X, y,
param_name='max_depth',
param_range=param_range,
cv=5,
scoring='accuracy',
n_jobs=-1
)
print("\nValidation Curve (max_depth):")
best_depth = 1
best_val = 0
for depth, train, val in zip(
param_range,
train_scores_vc.mean(axis=1),
val_scores_vc.mean(axis=1)
):
if val > best_val:
best_val = val
best_depth = depth
print(f" depth={depth:<3d} train={train:.3f} val={val:.3f}")
print(f"\nBest max_depth: {best_depth} (val accuracy: {best_val:.3f})")
Cross-Validation: Robust Evaluation
Cross-validation is the standard technique for reliably estimating generalization performance. K-Fold CV divides the dataset into K equal parts: at each iteration, one part is used as test and the remaining K-1 as training. It repeats K times and computes the average performance. Stratified K-Fold maintains class proportions in each fold, essential for imbalanced datasets.
from sklearn.model_selection import (
KFold, StratifiedKFold, RepeatedStratifiedKFold,
cross_val_score, cross_validate
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
import numpy as np
data = load_breast_cancer()
X, y = data.data, data.target
pipeline = Pipeline([
('scaler', StandardScaler()),
('clf', RandomForestClassifier(n_estimators=100, random_state=42))
])
# CV strategies
strategies = {
'5-Fold': KFold(n_splits=5, shuffle=True, random_state=42),
'Stratified 5-Fold': StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
'Repeated Strat 5x3': RepeatedStratifiedKFold(
n_splits=5, n_repeats=3, random_state=42
)
}
for name, cv in strategies.items():
scores = cross_val_score(pipeline, X, y, cv=cv, scoring='accuracy')
print(f"{name:<25s}: {scores.mean():.4f} (+/- {scores.std():.4f})")
# cross_validate for multiple metrics
results = cross_validate(
pipeline, X, y,
cv=StratifiedKFold(5, shuffle=True, random_state=42),
scoring=['accuracy', 'precision', 'recall', 'f1'],
return_train_score=True
)
print("\ncross_validate details:")
for metric in ['accuracy', 'precision', 'recall', 'f1']:
train = results[f'train_{metric}'].mean()
test = results[f'test_{metric}'].mean()
gap = train - test
print(f" {metric:<12s}: train={train:.3f} test={test:.3f} gap={gap:.3f}")
Regularization: L1 (Lasso) and L2 (Ridge)
Regularization adds a penalty term to the cost function to discourage overly complex models. L2 (Ridge) adds the sum of squared weights: it shrinks all weights toward zero but never zeroes them out. L1 (Lasso) adds the sum of absolute weight values: it can completely zero out some weights, implicitly performing feature selection. Elastic Net combines L1 and L2, controlling the mix with the l1_ratio parameter.
The parameter alpha (or C=1/alpha in LogisticRegression) controls regularization strength: high alpha penalizes more (simpler model, underfitting risk), low alpha penalizes less (more complex model, overfitting risk).
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
import numpy as np
data = load_breast_cancer()
X, y = data.data, data.target
# Regularization comparison for classification
regularizations = {
'No Reg (C=1e6)': LogisticRegression(C=1e6, max_iter=10000, random_state=42),
'L2 Weak (C=10)': LogisticRegression(C=10, penalty='l2', max_iter=10000, random_state=42),
'L2 Strong (C=0.01)': LogisticRegression(C=0.01, penalty='l2', max_iter=10000, random_state=42),
'L1 (C=1)': LogisticRegression(C=1, penalty='l1', solver='saga', max_iter=10000, random_state=42),
'ElasticNet': LogisticRegression(C=1, penalty='elasticnet', solver='saga',
l1_ratio=0.5, max_iter=10000, random_state=42)
}
print("Regularization Comparison:")
for name, model in regularizations.items():
pipeline = Pipeline([('scaler', StandardScaler()), ('clf', model)])
scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')
# Count non-zero coefficients (after fit)
pipeline.fit(X, y)
n_nonzero = np.sum(np.abs(pipeline.named_steps['clf'].coef_) > 1e-5)
print(f" {name:<22s}: acc={scores.mean():.3f} active_features={n_nonzero}/{X.shape[1]}")
Early Stopping and Data Augmentation
Early stopping is a regularization technique for iteratively trained models (gradient boosting, neural networks): it monitors validation set performance at each epoch and stops training when performance stops improving. It prevents overfitting without manually choosing the number of iterations.
Data augmentation is the most effective strategy against overfitting when data is scarce: generating new training samples through label-preserving transformations. For images: rotations, flips, crops, color variations. For text: synonyms, back-translation. For tabular data: SMOTE for imbalanced data or adding Gaussian noise.
Rule of thumb: If the gap between training accuracy and validation accuracy exceeds 5%, the model is probably overfitting. If validation accuracy is below 70% for a not-too-difficult problem, the model is probably underfitting. Learning curves are the most informative diagnostic tool.
Key Takeaways
- High bias = underfitting (too simple); High variance = overfitting (too complex)
- Learning curves visualize the gap between training and validation performance
- Cross-validation (Stratified K-Fold) is the gold standard for evaluation
- L2 (Ridge) shrinks all weights; L1 (Lasso) zeros out some weights (implicit feature selection)
- Early stopping halts training when validation score stops improving
- More data = less overfitting: data augmentation helps when the dataset is small







