Linear Regression: The Foundation of Machine Learning
Linear regression is the simplest and most fundamental algorithm in supervised Machine Learning. Its goal is to find the linear relationship between one or more independent variables (features) and a continuous dependent variable (target). Despite its simplicity, it is an extremely powerful tool and represents the foundation upon which much more complex algorithms are built.
Mathematically, simple linear regression models the relationship as y = wx + b, where w is the weight (slope) and b is the bias (intercept). For multiple features, it extends to y = w1*x1 + w2*x2 + ... + wn*xn + b. The algorithm searches for the values of w and b that minimize the error between predictions and actual values.
What You Will Learn in This Article
- How linear regression works and its mathematical formulation
- The Least Squares method
- Logistic regression for classification problems
- Gradient Descent: how algorithms learn
- Complete Python implementation with scikit-learn
- Evaluation metrics for regression and classification
The Cost Function: Measuring Error
To find optimal parameters, we need a function that measures how much our model is wrong: the cost function (or loss function). For linear regression, the standard cost function is the Mean Squared Error (MSE), which calculates the average of the squared differences between predicted and actual values.
MSE penalizes large errors more heavily due to squaring: an error of 10 weighs 100 times more than an error of 1. This makes the model sensitive to outliers. Alternatives like Mean Absolute Error (MAE) are more robust to outliers but less mathematically smooth.
import numpy as np
# Simple dataset: house area (m2) -> price (thousands euros)
X = np.array([50, 70, 80, 100, 120, 150])
y = np.array([150, 200, 220, 280, 330, 400])
# Calculate parameters with Ordinary Least Squares (OLS)
n = len(X)
x_mean = np.mean(X)
y_mean = np.mean(y)
# w = sum((xi - x_mean)(yi - y_mean)) / sum((xi - x_mean)^2)
numerator = np.sum((X - x_mean) * (y - y_mean))
denominator = np.sum((X - x_mean) ** 2)
w = numerator / denominator
b = y_mean - w * x_mean
print(f"Weight (w): {w:.2f}") # ~2.5 (each m2 is worth ~2500 euros)
print(f"Bias (b): {b:.2f}") # intercept
# Prediction for a 90 m2 house
price_90 = w * 90 + b
print(f"Estimated price for 90m2: {price_90:.0f}k euros")
# Calculate MSE
predictions = w * X + b
mse = np.mean((y - predictions) ** 2)
print(f"MSE: {mse:.2f}")
Gradient Descent: How Algorithms Learn
Gradient Descent is the most important optimization algorithm in ML. Instead of solving the equation analytically (as with least squares), gradient descent finds optimal parameters iteratively, moving in the direction that reduces the cost function the most.
It works like this: start with random parameters, calculate the gradient (the partial derivative of the cost function with respect to each parameter), and update the parameters in the opposite direction of the gradient. The learning rate controls the step size: too large and the algorithm oscillates without converging, too small and convergence is extremely slow.
import numpy as np
def gradient_descent(X, y, learning_rate=0.0001, epochs=1000):
"""Gradient descent for linear regression."""
w = 0.0 # initial weight
b = 0.0 # initial bias
n = len(X)
for epoch in range(epochs):
# Current predictions
y_pred = w * X + b
# Calculate gradients (partial derivatives of MSE)
dw = (-2/n) * np.sum(X * (y - y_pred))
db = (-2/n) * np.sum(y - y_pred)
# Update parameters
w -= learning_rate * dw
b -= learning_rate * db
if epoch % 200 == 0:
mse = np.mean((y - y_pred) ** 2)
print(f"Epoch {epoch}: MSE={mse:.2f}, w={w:.4f}, b={b:.4f}")
return w, b
# Usage
X = np.array([50, 70, 80, 100, 120, 150], dtype=float)
y = np.array([150, 200, 220, 280, 330, 400], dtype=float)
w_opt, b_opt = gradient_descent(X, y)
print(f"\nOptimal parameters: w={w_opt:.4f}, b={b_opt:.4f}")
Logistic Regression: From Regression to Classification
Despite its name, logistic regression is a classification algorithm. It transforms the linear regression output into a probability between 0 and 1 using the sigmoid function: σ(z) = 1 / (1 + e^(-z)). If the probability exceeds a threshold (typically 0.5), the sample is classified as positive, otherwise as negative.
The cost function for logistic regression is Binary Cross-Entropy (or Log Loss), which penalizes confident but wrong predictions. Gradient descent is also used here to optimize parameters. Logistic regression is surprisingly effective and is used as a baseline in many classification problems.
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
mean_squared_error, mean_absolute_error, r2_score
)
from sklearn.datasets import load_breast_cancer
import numpy as np
# --- CLASSIFICATION with Logistic Regression ---
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Training
log_reg = LogisticRegression(max_iter=10000, random_state=42)
log_reg.fit(X_train, y_train)
# Evaluation
y_pred = log_reg.predict(X_test)
print("=== Logistic Regression (Classification) ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall: {recall_score(y_test, y_pred):.3f}")
# Class probabilities
probabilities = log_reg.predict_proba(X_test)[:5]
print(f"\nFirst 5 probabilities:\n{probabilities}")
# --- REGRESSION with Linear Regression ---
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X_h, y_h = housing.data, housing.target
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
X_h, y_h, test_size=0.2, random_state=42
)
lin_reg = LinearRegression()
lin_reg.fit(X_train_h, y_train_h)
y_pred_h = lin_reg.predict(X_test_h)
print("\n=== Linear Regression ===")
print(f"MSE: {mean_squared_error(y_test_h, y_pred_h):.3f}")
print(f"MAE: {mean_absolute_error(y_test_h, y_pred_h):.3f}")
print(f"R2: {r2_score(y_test_h, y_pred_h):.3f}")
Visualizing the Decision Boundary
The decision boundary is the frontier that separates classes in the feature space. For logistic regression with two features, it is a straight line. Visualizing it helps understand how the model makes decisions and where it makes mistakes. With more than two features, the boundary becomes a hyperplane in multidimensional space, but the concept remains the same.
Linear vs logistic regression: Linear regression predicts continuous values (prices, temperatures, sales). Logistic regression predicts the probability of belonging to a class. Using them in the wrong context is one of the most common mistakes among ML beginners.
Evaluation Metrics
For regression: MSE (Mean Squared Error), MAE (Mean Absolute Error), RMSE (Root MSE), and R² (coefficient of determination, indicating the proportion of variance explained by the model).
For classification: Accuracy (percentage of correct predictions), Precision (how many predicted positives are correct), Recall (how many actual positives were found), and F1-Score (harmonic mean of precision and recall). We will explore each metric in depth in a dedicated article in this series.
Key Takeaways
- Linear regression finds the relationship y = wx + b by minimizing MSE
- Gradient Descent iteratively optimizes parameters in the direction of the negative gradient
- The learning rate controls convergence speed: too large or too small causes problems
- Logistic regression uses the sigmoid to transform linear regression into classification
- MSE/MAE/R² for regression, Accuracy/Precision/Recall/F1 for classification
- These two algorithms are the foundation upon which more advanced techniques are built







