Why Accuracy Is Not Enough
Evaluating Machine Learning models is far more complex than it appears. The most common mistake among beginners is relying exclusively on accuracy (percentage of correct predictions). With imbalanced datasets, where one class represents 95% of the data, a model that always predicts the majority class achieves 95% accuracy while being completely useless. More sophisticated metrics are needed to properly evaluate a model.
In this article we will explore the full arsenal of metrics available for classification and regression, understand when to use each one, and build a robust evaluation framework with cross-validation.
What You Will Learn in This Article
- Confusion Matrix: TP, TN, FP, FN
- Precision, Recall, and F1-Score
- ROC Curve and AUC
- Regression metrics: MSE, MAE, R²
- Cross-validation: robust evaluation
- How to choose the right metric for your problem
The Confusion Matrix
The confusion matrix is the foundation of all classification metrics. For a binary problem, it organizes predictions into four categories: True Positive (TP) — correctly predicted as positive; True Negative (TN) — correctly predicted as negative; False Positive (FP) — incorrectly predicted as positive (Type I error); False Negative (FN) — incorrectly predicted as negative (Type II error).
from sklearn.metrics import (
confusion_matrix, classification_report,
precision_score, recall_score, f1_score,
roc_auc_score, roc_curve, accuracy_score
)
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
# Dataset
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Model
pipeline = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression(max_iter=10000, random_state=42))
])
pipeline.fit(X_train, y_train)
# Predictions and probabilities
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()
print("Confusion Matrix:")
print(f" TN={tn} FP={fp}")
print(f" FN={fn} TP={tp}")
# Metrics
print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall: {recall_score(y_test, y_pred):.3f}")
print(f"F1-Score: {f1_score(y_test, y_pred):.3f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.3f}")
# Complete report
print(f"\n{classification_report(y_test, y_pred, target_names=data.target_names)}")
Precision, Recall, and F1-Score
Precision = TP / (TP + FP): of all samples predicted as positive, how many actually are? High precision means few false positives. Crucial when false positives are costly (e.g., credit approval: you do not want to approve an insolvent customer).
Recall (Sensitivity) = TP / (TP + FN): of all actually positive samples, how many were found? High recall means few false negatives. Crucial when false negatives are dangerous (e.g., tumor diagnosis: you do not want to miss a positive case).
F1-Score = 2 * (Precision * Recall) / (Precision + Recall): the harmonic mean of precision and recall. Useful when a balance between the two is needed. The harmonic mean penalizes extreme values: if one metric is very low, the F1 will be low.
Precision-Recall Trade-off: Raising the classification threshold (above 0.5) increases precision but decreases recall. Lowering it does the opposite. There is no universally better threshold: it depends on the relative cost of false positives vs false negatives in your specific domain.
ROC Curve and AUC
The ROC curve (Receiver Operating Characteristic) visualizes the trade-off between True Positive Rate (recall) and False Positive Rate across different classification thresholds. AUC (Area Under the Curve) summarizes the ROC in a single number: 1.0 is a perfect model, 0.5 is equivalent to chance (random classification). AUC measures the model's ability to separate classes regardless of the chosen threshold.
from sklearn.metrics import precision_recall_curve, roc_curve, roc_auc_score
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
# Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('clf', RandomForestClassifier(n_estimators=100, random_state=42))
])
# --- CROSS-VALIDATION with multiple metrics ---
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
metrics = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
print("Cross-Validation Results:")
for metric in metrics:
scores = cross_val_score(pipeline, X, y, cv=cv, scoring=metric)
print(f" {metric:<12s}: {scores.mean():.3f} (+/- {scores.std():.3f})")
# --- THRESHOLD TUNING ---
pipeline.fit(X_train, y_train)
y_proba = pipeline.predict_proba(X_test)[:, 1]
# Precision-Recall for different thresholds
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)
# Find threshold that maximizes F1
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-10)
best_idx = np.argmax(f1_scores)
best_threshold = thresholds[best_idx] if best_idx < len(thresholds) else 0.5
print(f"\nBest threshold for F1: {best_threshold:.3f}")
print(f" Precision: {precisions[best_idx]:.3f}")
print(f" Recall: {recalls[best_idx]:.3f}")
print(f" F1: {f1_scores[best_idx]:.3f}")
Regression Metrics
For regression, metrics measure the distance between predicted and actual values. MSE (Mean Squared Error) penalizes large errors quadratically. MAE (Mean Absolute Error) treats all errors linearly, more robust to outliers. RMSE (Root MSE) has the same units as the target, more interpretable. R² indicates the proportion of variance explained by the model: 1.0 is perfect, 0 is equivalent to always predicting the mean.
How to Choose the Right Metric
Metric choice depends on the domain and error costs. For medical diagnosis, recall is critical (you do not want to miss a case). For spam filtering, precision is important (you do not want to lose important emails). For imbalanced datasets, F1-Score and AUC are preferable to accuracy. For regression with outliers, MAE is better than MSE. Always discuss with domain stakeholders to understand which error type is most costly.
Key Takeaways
- Accuracy alone is insufficient, especially with imbalanced datasets
- Precision measures correctness of predicted positives; Recall measures completeness
- F1-Score balances Precision and Recall with the harmonic mean
- ROC-AUC measures separation capability regardless of threshold
- Cross-validation with StratifiedKFold is the gold standard for evaluation
- Metric choice depends on the relative costs of different error types in your domain







