What Are Support Vector Machines
Support Vector Machines (SVMs) are extremely powerful supervised ML algorithms for classification and regression. The central idea is finding the hyperplane that separates classes with the maximum margin possible. The points closest to the hyperplane, called support vectors, are the ones that define the decision boundary: all other data is irrelevant to the hyperplane's position.
This property makes SVMs particularly robust: the model depends only on a small subset of training data. Furthermore, thanks to the kernel trick, SVMs can handle non-linear separations by projecting data into high-dimensional spaces where they become linearly separable.
What You Will Learn in This Article
- The concept of hyperplane and maximum margin
- Hard margin vs soft margin: error tolerance
- The kernel trick for non-linear separations
- Hyperparameters C and gamma: how to tune them
- Multiclass SVM with one-vs-rest
- Practical implementation with scikit-learn
Maximum Margin and Support Vectors
In a binary classification problem, there are infinitely many hyperplanes that can separate two classes. The SVM chooses the one with the maximum margin: the distance between the hyperplane and the closest points of each class. Maximizing the margin improves the model's generalization on unseen data. The points that lie exactly on the margin are the support vectors.
Hard margin SVM requires that no point falls inside the margin or on the wrong side. This only works with perfectly separable data. Soft margin SVM introduces the parameter C that controls the tradeoff between maximizing the margin and minimizing violations: a high C severely penalizes violations (narrow margin), a low C tolerates more errors (wide margin, more regularization).
from sklearn.svm import SVC, LinearSVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline
# Dataset
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Pipeline with scaling (ESSENTIAL for SVM!)
svm_pipeline = Pipeline([
('scaler', StandardScaler()),
('svm', SVC(kernel='linear', C=1.0, random_state=42))
])
svm_pipeline.fit(X_train, y_train)
y_pred = svm_pipeline.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"\n{classification_report(y_test, y_pred, target_names=data.target_names)}")
# Number of support vectors per class
svm_model = svm_pipeline.named_steps['svm']
print(f"Support vectors per class: {svm_model.n_support_}")
print(f"Total support vectors: {sum(svm_model.n_support_)} out of {len(X_train)} samples")
The Kernel Trick: Non-Linear Separations
When data is not linearly separable, the kernel trick projects data into a higher-dimensional space where it becomes separable. The trick is that this projection happens implicitly through a kernel function, without ever explicitly computing coordinates in the high-dimensional space.
The most common kernels are: RBF (Radial Basis Function), the most versatile, which measures similarity between points as a Gaussian; Polynomial, which captures feature interactions up to a certain degree; Sigmoid, similar to a single-layer neural network. The kernel choice depends on the geometry of the data.
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_moons
import numpy as np
# Non-linearly separable dataset
X, y = make_moons(n_samples=500, noise=0.2, random_state=42)
# Kernel comparison
kernels = {
'linear': {'svm__C': [0.1, 1, 10]},
'rbf': {'svm__C': [0.1, 1, 10], 'svm__gamma': ['scale', 'auto', 0.1, 1]},
'poly': {'svm__C': [0.1, 1, 10], 'svm__degree': [2, 3, 4]}
}
for kernel, params in kernels.items():
pipeline = Pipeline([
('scaler', StandardScaler()),
('svm', SVC(kernel=kernel, random_state=42))
])
grid = GridSearchCV(pipeline, params, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X, y)
print(f"Kernel {kernel:<8s} - Best accuracy: {grid.best_score_:.3f}")
print(f" Best params: {grid.best_params_}")
Hyperparameter Tuning
SVMs have two crucial hyperparameters: C (regularization) and gamma (for RBF and poly kernels). C controls the margin/error tradeoff: high C values try to correctly classify every training point (overfitting risk), low values tolerate more errors for a wider margin (underfitting risk).
Gamma controls the influence radius of each support vector: high gamma makes the model sensitive to individual points (overfitting), low gamma makes the decision boundary smoother (underfitting). The optimal combination of C and gamma is found with Grid Search or Random Search with cross-validation.
Mandatory scaling: SVMs are sensitive to feature scale. If one feature ranges from 0 to 1 and another from 0 to 1000, the second will dominate distance calculations. Always use StandardScaler or MinMaxScaler before SVM. This is the most common mistake with SVMs.
Multiclass SVM and Regression
SVMs are natively binary, but scikit-learn supports multiclass classification through the one-vs-rest (OvR) strategy: for N classes, N classifiers are trained, each distinguishing one class from all others. The predicted class is the one with the highest score.
For regression, SVR (Support Vector Regression) uses the same principle but looks for a tube (epsilon-tube) that contains as many points as possible. Points outside the tube become support vectors. Here too, the kernel trick enables non-linear regressions.
When to Use SVMs
SVMs excel with medium-sized datasets (up to tens of thousands of samples), in high-dimensional spaces, and when classes are well-separated with clear margins. They are less suitable for very large datasets (training complexity scales quadratically), very noisy datasets, and when calibrated probabilities are needed (native SVM does not produce them, requiring additional calibration).
Key Takeaways
- SVMs find the hyperplane with maximum margin between classes
- Support vectors are the points that define the decision boundary
- The kernel trick handles non-linear separations without explicit high-dimensional calculations
- C controls the margin/error tradeoff, gamma controls support vector influence radius
- Feature scaling is mandatory before using SVMs
- Grid Search with cross-validation is the standard way to find the best hyperparameters







