Introduction: The Loss Function Drives Everything
The loss function is the criterion a model optimizes during training. It is the "compass" of learning: it defines what it means to be "wrong" and by how much. Choosing the wrong loss can make learning impossible even with the perfect architecture.
What You Will Learn
- Regression losses: MSE, MAE, Huber Loss
- Classification losses: Binary and multiclass Cross-Entropy
- Focal Loss for imbalanced datasets
- Hinge Loss for SVMs and margins
- Contrastive and Triplet Loss for embeddings
- How to create custom loss functions
Regression Losses
Mean Squared Error (MSE)
The most used regression loss. Penalizes large errors quadratically:
The derivative with respect to prediction \\hat{y}_i:
Pros: differentiable everywhere, gradient proportional to error. Cons: sensitive to outliers (a single anomalous point can dominate the loss).
Mean Absolute Error (MAE)
Pros: robust to outliers (linear penalty). Cons: constant gradient (does not accelerate near minimum), not differentiable at zero.
Huber Loss: The Compromise
Huber Loss combines the best of MSE and MAE: behaves like MSE for small errors and like MAE for large errors:
import numpy as np
def mse(y_true, y_pred):
return np.mean((y_true - y_pred)**2)
def mae(y_true, y_pred):
return np.mean(np.abs(y_true - y_pred))
def huber_loss(y_true, y_pred, delta=1.0):
error = np.abs(y_true - y_pred)
quadratic = np.minimum(error, delta)
linear = error - quadratic
return np.mean(0.5 * quadratic**2 + delta * linear)
# Normal data
y_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_pred = np.array([1.1, 2.2, 2.8, 4.1, 4.9])
print("Without outliers:")
print(f" MSE: {mse(y_true, y_pred):.4f}")
print(f" MAE: {mae(y_true, y_pred):.4f}")
print(f" Huber: {huber_loss(y_true, y_pred):.4f}")
# Add outlier
y_true_out = np.append(y_true, 6.0)
y_pred_out = np.append(y_pred, 20.0) # Outlier!
print("\nWith outlier:")
print(f" MSE: {mse(y_true_out, y_pred_out):.4f}")
print(f" MAE: {mae(y_true_out, y_pred_out):.4f}")
print(f" Huber: {huber_loss(y_true_out, y_pred_out):.4f}")
Classification Losses
Binary Cross-Entropy (BCE)
For binary classification with sigmoid output \\hat{y} = \\sigma(z) \\in (0, 1):
The derivative (elegantly simple):
Categorical Cross-Entropy
For multiclass classification with softmax output:
With one-hot labels (only one y_c = 1), it becomes: L = -\\log(\\hat{y}_c), where c is the correct class.
Focal Loss: For Imbalanced Datasets
Focal Loss reduces the weight of easy examples to focus on hard ones:
where p_t is the probability of the correct class, \\gamma \\geq 0 is the focusing parameter (typically 2), and \\alpha_t is the balancing factor. When the model is already confident (p_t high), the term (1 - p_t)^\\gamma reduces the loss contribution.
import numpy as np
def softmax(z):
exp_z = np.exp(z - np.max(z, axis=-1, keepdims=True))
return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
def cross_entropy_loss(y_true, logits):
"""Categorical cross-entropy with softmax."""
probs = softmax(logits)
n = y_true.shape[0]
log_probs = -np.log(probs[np.arange(n), y_true] + 1e-15)
return np.mean(log_probs)
def focal_loss(y_true, logits, gamma=2.0, alpha=1.0):
"""Focal loss for imbalanced datasets."""
probs = softmax(logits)
n = y_true.shape[0]
pt = probs[np.arange(n), y_true]
focal_weight = alpha * (1 - pt) ** gamma
log_probs = -np.log(pt + 1e-15)
return np.mean(focal_weight * log_probs)
# Example: 3 classes
y_true = np.array([0, 1, 2, 1, 0])
logits = np.array([
[2.0, 0.5, -1.0], # Class 0 easy
[0.1, 3.0, -0.5], # Class 1 easy
[-1.0, 0.2, 1.5], # Class 2 medium
[0.5, 0.3, 0.2], # Class 1 hard
[0.1, 0.1, 0.1], # Class 0 very hard
])
print(f"Cross-Entropy: {cross_entropy_loss(y_true, logits):.4f}")
print(f"Focal Loss (gamma=2): {focal_loss(y_true, logits, gamma=2):.4f}")
print(f"Focal Loss (gamma=0): {focal_loss(y_true, logits, gamma=0):.4f}") # = CE
Losses for Similarity Learning
Hinge Loss
Used in SVMs (Support Vector Machines), requires a minimum margin between classes:
where y \\in \\{-1, +1\\}. If the model classifies correctly with sufficient margin, the loss is zero.
Contrastive Loss
For embedding pairs, brings similar ones closer and pushes different ones apart:
where d = \\|f(x_1) - f(x_2)\\|, y=0 for similar pairs, y=1 for dissimilar pairs, and m is the margin.
Triplet Loss
Uses triplets (anchor, positive, negative) to learn embeddings:
The anchor must be closer to the positive than to the negative, with a margin m.
Temperature Scaling and Calibration
Temperature scaling is a post-hoc calibration technique that adds a parameter T > 0 to the softmax:
T > 1: "softer" distributions (less confident). T < 1: "harder" distributions (more confident). T = 1: standard softmax.
Guide to Choosing the Right Loss
Practical rules:
- Regression: MSE (standard), Huber (with outliers), MAE (median vs mean)
- Binary classification: Binary Cross-Entropy + sigmoid
- Multiclass classification: Cross-Entropy + softmax
- Imbalanced dataset: Focal Loss or Cross-Entropy with class weights
- Similarity/Embedding: Triplet Loss or Contrastive Loss
- Ranking: Hinge Loss or Margin Ranking Loss
Summary
Key Takeaways
- MSE: \\frac{1}{n}\\sum(y - \\hat{y})^2 - standard for regression, sensitive to outliers
- Cross-Entropy: -\\sum y_k \\log \\hat{y}_k - standard for classification
- Focal Loss: adds (1-p_t)^\\gamma to balance classes
- Huber Loss: robust MSE/MAE compromise with threshold \\delta
- Triplet Loss: for learning embeddings with margin
- Temperature: T scales softmax confidence
In the Next Article: we will explore inferential statistics for data scientists. Confidence intervals, t-tests, p-values, A/B testing, and how to avoid the most common statistical pitfalls.







