Introduction: Why Linear Algebra Is the Language of Machine Learning
Every time a machine learning model processes an image, classifies a text, or generates a prediction, under the hood it is performing linear algebra operations. Input data is represented as vectors, model weights as matrices, and the entire inference process boils down to a series of matrix multiplications and linear transformations.
In this article, we will build the mathematical foundations from basic concepts all the way to eigenvalues, SVD, and decompositions that underlie algorithms like PCA, recommendation systems, and model compression. Every formula will be accompanied by an intuitive explanation and a NumPy implementation.
What You Will Learn
- Vectors, norms, and dot product: the geometry of data
- Matrices: multiplication, transpose, inverse, and their meaning
- Determinant and rank: what they tell us about transformations
- Eigenvalues and eigenvectors: the invariant directions
- Singular Value Decomposition (SVD): ML's most powerful tool
- Practical NumPy implementations
Vectors: The Building Blocks
A vector is an ordered list of numbers. In ML, a vector represents a single data point: the features of an image, the pixels of a frame, the encoded words of a sentence. A vector in \\mathbb{R}^n has n components.
For example, a 3-dimensional vector:
Vector Norms: Measuring Magnitude
The norm measures the "length" of a vector. The two most commonly used norms in ML are:
L2 Norm (Euclidean) - the geometric distance from the origin:
L1 Norm (Manhattan) - the sum of absolute values, useful for promoting sparsity:
In ML, the L2 norm is used in Ridge regularization to penalize overly large weights, while the L1 norm in Lasso regularization to obtain sparse weights (many set to zero), useful for feature selection.
Dot Product: Measuring Similarity
The dot product between two vectors is perhaps the most important operation in ML. It measures how much two vectors "point in the same direction":
Geometrically, the dot product is related to the angle \\theta between the vectors:
When \\cos\\theta = 1, the vectors are parallel (maximum similarity). When \\cos\\theta = 0, they are orthogonal (no relationship). This is exactly the principle behind cosine similarity used in recommendation systems and semantic search.
Key Intuition: In a neural network, each neuron computes a dot product between the input vector \\mathbf{x} and the weight vector \\mathbf{w}, adds a bias b, and applies an activation function: \\sigma(\\mathbf{w} \\cdot \\mathbf{x} + b).
import numpy as np
# Vectors
a = np.array([2, -1, 4])
b = np.array([1, 3, 2])
# Dot product
dot = np.dot(a, b) # 2*1 + (-1)*3 + 4*2 = 7
print(f"Dot product: {dot}")
# Norms
l2_norm = np.linalg.norm(a) # sqrt(4 + 1 + 16) = sqrt(21)
l1_norm = np.linalg.norm(a, ord=1) # 2 + 1 + 4 = 7
print(f"L2 norm: {l2_norm:.4f}")
print(f"L1 norm: {l1_norm}")
# Cosine similarity
cos_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"Cosine similarity: {cos_sim:.4f}")
Matrices: Data Transformations
A matrix is a rectangular array of numbers. In ML, matrices represent datasets (rows = samples, columns = features) and neural network weights. A matrix \\mathbf{A} \\in \\mathbb{R}^{m \\times n} has m rows and n columns.
Matrix Multiplication: The Heart of Deep Learning
Multiplication of a matrix \\mathbf{A} \\in \\mathbb{R}^{m \\times n} by a vector \\mathbf{x} \\in \\mathbb{R}^n produces a new vector \\mathbf{y} \\in \\mathbb{R}^m:
This operation is a linear transformation: it takes a vector in the input space and maps it to an output space. A neural network layer does exactly this:
where \\mathbf{W} is the weight matrix, \\mathbf{x} the input, \\mathbf{b} the bias, and \\sigma the activation function.
Transpose and Symmetry
The transpose \\mathbf{A}^T swaps rows and columns: (\\mathbf{A}^T)_{ij} = A_{ji}. It is fundamental because:
- The dot product can be written as \\mathbf{a}^T \\mathbf{b}
- The covariance matrix is \\frac{1}{n}\\mathbf{X}^T\\mathbf{X}
- In backpropagation, gradients are propagated with weight transposes
Determinant: Volume and Invertibility
The determinant of a square matrix measures how the transformation scales volumes. For a 2x2 matrix:
If \\det(\\mathbf{A}) = 0, the matrix is singular (non-invertible): the transformation "squashes" the space, losing information. This signals collinearity in features, which causes issues in linear regression.
Rank: Effective Dimensionality
The rank of a matrix is the number of linearly independent rows (or columns). If a matrix \\mathbf{A} \\in \\mathbb{R}^{m \\times n} has rank r < \\min(m, n), it means the data lives in a subspace of dimension r, not in the full space. This is the principle behind dimensionality reduction.
import numpy as np
# Weight matrix (neural layer: 3 inputs -> 2 outputs)
W = np.array([[0.5, -0.3, 0.8],
[0.2, 0.7, -0.4]])
x = np.array([1.0, 2.0, 3.0])
# Forward pass: linear transformation
y = W @ x # or np.dot(W, x)
print(f"Output: {y}")
# Transpose
print(f"W shape: {W.shape}") # (2, 3)
print(f"W^T shape: {W.T.shape}") # (3, 2)
# Determinant (square matrices only)
A = np.array([[3, 1], [2, 4]])
det = np.linalg.det(A)
print(f"Determinant: {det:.2f}") # 10.0
# Rank
rank = np.linalg.matrix_rank(W)
print(f"Rank: {rank}") # 2
Eigenvalues and Eigenvectors: The Special Directions
The eigenvectors of a matrix are the directions that do not change orientation when the transformation is applied. They are only scaled by a factor called the eigenvalue. Formally, for a square matrix \\mathbf{A}:
where \\mathbf{v} is an eigenvector and \\lambda the corresponding eigenvalue.
Geometric intuition: imagine stretching a rubber sheet. Most points move in different directions, but some directions are only elongated or compressed without rotation. Those are the eigenvector directions, and how much they are stretched is given by the eigenvalues.
To find the eigenvalues, we solve the characteristic equation:
In ML, the eigenvalues of the covariance matrix tell us how much variance there is along each principal direction. This is the foundation of PCA (Principal Component Analysis).
ML Application: In PCA, the eigenvectors of the covariance matrix are the principal components, and the eigenvalues indicate how much variance each component captures. By selecting the top-k eigenvectors, we reduce dimensionality while retaining most of the information.
import numpy as np
# Symmetric matrix (like a covariance matrix)
A = np.array([[4, 2],
[2, 3]])
# Compute eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eigh(A)
print(f"Eigenvalues: {eigenvalues}")
print(f"Eigenvectors:\n{eigenvectors}")
# Verify: A @ v = lambda * v
for i in range(len(eigenvalues)):
v = eigenvectors[:, i]
lam = eigenvalues[i]
lhs = A @ v
rhs = lam * v
print(f"A*v = {lhs}, lambda*v = {rhs}, equal: {np.allclose(lhs, rhs)}")
Singular Value Decomposition (SVD): The Universal Tool
SVD is the most important and versatile decomposition in linear algebra for ML. Any matrix \\mathbf{A} \\in \\mathbb{R}^{m \\times n} can be decomposed as:
where:
- \\mathbf{U} \\in \\mathbb{R}^{m \\times m} - orthogonal matrix (output directions)
- \\boldsymbol{\\Sigma} \\in \\mathbb{R}^{m \\times n} - diagonal matrix with singular values \\sigma_1 \\geq \\sigma_2 \\geq \\cdots \\geq 0
- \\mathbf{V}^T \\in \\mathbb{R}^{n \\times n} - orthogonal matrix (input directions)
Intuition: SVD decomposes any linear transformation into three steps: a rotation (\\mathbf{V}^T), a scaling along the axes (\\boldsymbol{\\Sigma}), and another rotation (\\mathbf{U}).
Truncated SVD: Intelligent Compression
By keeping only the first k singular values, we obtain the best rank-k approximation of the original matrix:
This is used for: image compression, recommendation systems (matrix factorization), noise reduction, and Latent Semantic Analysis (LSA) for text.
import numpy as np
# Matrix (e.g., user-product ratings)
A = np.array([[5, 4, 0, 0],
[4, 5, 0, 0],
[0, 0, 4, 5],
[0, 0, 5, 4]])
# Full SVD
U, sigma, Vt = np.linalg.svd(A)
print(f"Singular values: {sigma}")
# Truncated SVD (rank-2 approximation)
k = 2
U_k = U[:, :k]
sigma_k = np.diag(sigma[:k])
Vt_k = Vt[:k, :]
A_approx = U_k @ sigma_k @ Vt_k
print(f"Original matrix:\n{A}")
print(f"Rank-{k} approximation:\n{np.round(A_approx, 2)}")
# Reconstruction error
error = np.linalg.norm(A - A_approx, 'fro')
print(f"Frobenius error: {error:.4f}")
# Explained variance
explained = np.sum(sigma[:k]**2) / np.sum(sigma**2) * 100
print(f"Variance explained with k={k}: {explained:.1f}%")
Broadcasting and Efficient Operations in NumPy
NumPy supports broadcasting, a mechanism that allows operations between arrays of different sizes without creating copies. This is fundamental for writing efficient ML code.
import numpy as np
# Dataset: 1000 samples, 10 features
X = np.random.randn(1000, 10)
# Normalization: subtract mean, divide by std
mean = X.mean(axis=0) # shape: (10,)
std = X.std(axis=0) # shape: (10,)
X_norm = (X - mean) / std # Automatic broadcasting!
# Batch matrix multiplication
# W: (10, 5), X: (1000, 10) -> output: (1000, 5)
W = np.random.randn(10, 5)
output = X_norm @ W # Equivalent to np.dot(X_norm, W)
print(f"Output shape: {output.shape}")
# Element-wise vs matrix multiplication
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])
print(f"Element-wise: {a * b}") # [[5, 12], [21, 32]]
print(f"Matrix mult: {a @ b}") # [[19, 22], [43, 50]]
Practical Application: Neural Network Forward Pass
Let us put everything together by implementing a complete forward pass of a 2-layer neural network using only linear algebra:
import numpy as np
def relu(x):
return np.maximum(0, x)
def softmax(x):
exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
return exp_x / np.sum(exp_x, axis=1, keepdims=True)
# Architecture: 4 inputs -> 8 hidden -> 3 outputs (classification)
np.random.seed(42)
W1 = np.random.randn(4, 8) * 0.1 # Layer 1 weights
b1 = np.zeros(8) # Layer 1 bias
W2 = np.random.randn(8, 3) * 0.1 # Layer 2 weights
b2 = np.zeros(3) # Layer 2 bias
# Input: batch of 5 samples, 4 features each
X = np.random.randn(5, 4)
# Forward pass (pure linear algebra!)
# Layer 1: linear transformation + activation
z1 = X @ W1 + b1 # (5, 4) @ (4, 8) + (8,) = (5, 8)
h1 = relu(z1) # (5, 8) - ReLU activation
# Layer 2: linear transformation + softmax
z2 = h1 @ W2 + b2 # (5, 8) @ (8, 3) + (3,) = (5, 3)
probs = softmax(z2) # (5, 3) - probabilities for 3 classes
print(f"Output probabilities:\n{np.round(probs, 4)}")
print(f"Row sums: {probs.sum(axis=1)}") # Should be ~1.0
print(f"Predicted classes: {np.argmax(probs, axis=1)}")
Summary and Connections to ML
Key Takeaways
- Dot product \\mathbf{a} \\cdot \\mathbf{b}: measures similarity, the base operation of every neuron
- Matrix multiplication \\mathbf{W}\\mathbf{x}: linear transformation, the heart of forward passes
- Eigenvalues: indicate directions of maximum variance (PCA)
- SVD: universal decomposition for compression, recommendation, denoising
- L2 norm: used for Ridge regularization, prevents overfitting
- Rank: effective data dimensionality, basis for dimensionality reduction
In the Next Article: we will explore differential calculus for deep learning. We will see how gradients allow neural networks to learn and how the chain rule makes backpropagation possible.







