Introduction: Compressing Data Without Losing Information
Real-world datasets often have hundreds or thousands of features. Many of these are redundant or correlated with each other. Dimensionality reduction allows compressing data into a lower-dimensional space while retaining most of the useful information. The most used algorithm is PCA (Principal Component Analysis), which relies entirely on eigenvalues and eigenvectors of the covariance matrix.
What You Will Learn
- The curse of dimensionality and why reduction matters
- Covariance matrix: understanding correlations
- PCA: finding the directions of maximum variance
- Explained variance and choosing the number of components
- t-SNE and UMAP for non-linear visualization
- Full implementation in NumPy and scikit-learn
The Curse of Dimensionality
In high-dimensional spaces, data becomes sparse: all points are approximately equidistant. This makes distance metrics less useful and causes overfitting in models. PCA fights this by projecting data onto the most informative directions.
Covariance Matrix
The covariance matrix \\mathbf{C} captures the correlations between all pairs of features. For a centered dataset (zero mean) \\mathbf{X} \\in \\mathbb{R}^{n \\times d}:
Each element C_{ij} is the covariance between features i and j:
The diagonal contains each feature's variance, off-diagonal elements contain covariances. If C_{ij} > 0, features are positively correlated; if C_{ij} = 0, they are uncorrelated.
PCA: Mathematical Derivation
PCA seeks the directions along which data has maximum variance. The first principal component \\mathbf{w}_1 is the unit vector that maximizes the variance of the projection:
Using Lagrange multipliers, it can be shown that the solution is the eigenvector corresponding to the largest eigenvalue of \\mathbf{C}:
where \\lambda_1 \\geq \\lambda_2 \\geq \\cdots \\geq \\lambda_d \\geq 0 are the eigenvalues in decreasing order. The eigenvalue \\lambda_i is exactly the variance captured by the i-th principal component.
Projection and Reconstruction
To reduce to k dimensions, we project onto the first k eigenvectors:
The approximate reconstruction is:
Explained Variance
The explained variance by the first k components is:
In practice, k is chosen to retain 95% or 99% of total variance.
import numpy as np
# Synthetic dataset: 200 samples, 5 features (correlated)
np.random.seed(42)
n, d = 200, 5
X = np.random.randn(n, 2) @ np.array([[2, 1, 0.5, 0.3, 0.1],
[0.5, 1.5, 1, 0.2, 0.8]])
X += np.random.randn(n, d) * 0.3 # Noise
# PCA from scratch
# 1. Center the data
X_centered = X - X.mean(axis=0)
# 2. Covariance matrix
C = np.cov(X_centered, rowvar=False)
print(f"Covariance matrix:\n{np.round(C, 3)}\n")
# 3. Eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eigh(C)
# Sort in descending order
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
print(f"Eigenvalues: {np.round(eigenvalues, 4)}")
# 4. Explained variance
var_explained = eigenvalues / eigenvalues.sum()
cumulative = np.cumsum(var_explained)
for i in range(d):
print(f"PC{i+1}: {var_explained[i]*100:.1f}% (cumulative: {cumulative[i]*100:.1f}%)")
# 5. Project to 2D
k = 2
W_k = eigenvectors[:, :k]
Z = X_centered @ W_k
print(f"\nOriginal shape: {X.shape} -> Reduced: {Z.shape}")
# 6. Reconstruction error
X_reconstructed = Z @ W_k.T + X.mean(axis=0)
reconstruction_error = np.mean((X - X_reconstructed)**2)
print(f"Reconstruction error (MSE): {reconstruction_error:.6f}")
PCA with Scikit-Learn
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
# Standardization (important! PCA is scale-sensitive)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Automatic PCA
pca = PCA(n_components=0.95) # Keep 95% variance
X_pca = pca.fit_transform(X_scaled)
print(f"Components selected: {pca.n_components_}")
print(f"Explained variance: {pca.explained_variance_ratio_}")
print(f"Shape: {X.shape} -> {X_pca.shape}")
Beyond PCA: t-SNE and UMAP
PCA is limited to linear transformations. For non-linear structures in data, methods like t-SNE and UMAP are used.
t-SNE
t-SNE (t-distributed Stochastic Neighbor Embedding) preserves local distances: nearby points in the original space remain close in the 2D representation. It minimizes the KL divergence between similarity distributions in the original and reduced spaces:
UMAP
UMAP (Uniform Manifold Approximation and Projection) is faster than t-SNE and better preserves global structure. It is based on algebraic topology and fuzzy graph theory.
When to use which: PCA for preprocessing (reduce dimensions before a classifier, denoising); t-SNE/UMAP for 2D/3D visualization (explore clusters, outliers). PCA is invertible and interpretable, t-SNE/UMAP are not.
Application: PCA for ML Preprocessing
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np
# Digits dataset: 1797 images 8x8 = 64 features
digits = load_digits()
X, y = digits.data, digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Without PCA (64 features)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
clf_full = LogisticRegression(max_iter=5000)
clf_full.fit(X_train_s, y_train)
acc_full = clf_full.score(X_test_s, y_test)
# With PCA (keep 95% variance)
pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train_s)
X_test_pca = pca.transform(X_test_s)
clf_pca = LogisticRegression(max_iter=5000)
clf_pca.fit(X_train_pca, y_train)
acc_pca = clf_pca.score(X_test_pca, y_test)
print(f"Without PCA: {X_train_s.shape[1]} features, Accuracy: {acc_full:.4f}")
print(f"With PCA: {X_train_pca.shape[1]} features, Accuracy: {acc_pca:.4f}")
print(f"Reduction: {(1 - X_train_pca.shape[1]/X_train_s.shape[1])*100:.0f}% of features")
Summary and Connections to ML
Key Takeaways
- PCA: projects onto the first k eigenvectors of the covariance matrix
- Eigenvalues \\lambda_i: variance captured by each component
- Explained variance: choose k to retain 95%+ of total variance
- Standardization: essential before PCA (scale-sensitive)
- t-SNE/UMAP: for non-linear 2D/3D visualization
- PCA for preprocessing: reduces overfitting and speeds up training
In the Next Article: we will explore loss functions in detail. MSE, Cross-Entropy, Focal Loss, Hinge Loss, and how to choose and create custom ones.







