K-Nearest Neighbors: The Lazy Classifier
K-Nearest Neighbors (KNN) is one of the most intuitive Machine Learning algorithms: to classify a new point, it looks at the K nearest points in the training dataset and assigns the majority class. It does not build an explicit model during training (that is why it is called a lazy learner): all computational work happens at prediction time, when the algorithm must search for neighbors in the feature space.
KNN's simplicity is both its strength and weakness: it is easy to understand and implement, but becomes slow with large datasets because it must compute the distance to every training point for each new prediction. Data structures like KD-Tree and Ball Tree mitigate this issue but do not eliminate it completely.
What You Will Learn in This Article
- How KNN works and distance metrics
- How to choose the optimal value of K
- K-Means: the most popular clustering algorithm
- DBSCAN: density-based clustering
- How to evaluate clustering without labels
- Silhouette Score and Elbow method
Distance Metrics
KNN relies on the concept of distance between points in the feature space. The choice of metric significantly influences results. Euclidean distance is the most common: the straight line between two points in n-dimensional space. Manhattan distance calculates the sum of absolute differences (like walking in a street grid). Minkowski distance is a generalization of both, controlled by the parameter p.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
# Dataset
wine = load_wine()
X, y = wine.data, wine.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Find optimal K
k_range = range(1, 31)
cv_scores = []
for k in k_range:
pipeline = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=k))
])
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='accuracy')
cv_scores.append(scores.mean())
best_k = k_range[np.argmax(cv_scores)]
print(f"Best K: {best_k} with accuracy: {max(cv_scores):.3f}")
# Final model with optimal K
final_pipeline = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=best_k, weights='distance'))
])
final_pipeline.fit(X_train, y_train)
test_accuracy = final_pipeline.score(X_test, y_test)
print(f"Test accuracy with K={best_k}: {test_accuracy:.3f}")
K-Means Clustering
K-Means is the most widely used clustering algorithm. It partitions data into K clusters, where each cluster is defined by its centroid (the mean point). The algorithm alternates between two phases: assigning each point to the nearest centroid, and recalculating centroids as the mean of assigned points. It repeats until convergence.
K-Means works well with spherical clusters of similar sizes but has important limitations: it requires K to be specified in advance, is sensitive to centroid initialization (partially solved by K-Means++), and does not handle irregularly shaped or differently sized clusters well.
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import numpy as np
# Example data: customer segmentation
np.random.seed(42)
# 3 natural customer groups
group1 = np.random.randn(100, 2) * 0.5 + [2, 2] # Budget
group2 = np.random.randn(100, 2) * 0.8 + [7, 7] # Premium
group3 = np.random.randn(100, 2) * 0.6 + [2, 8] # Frequent
X = np.vstack([group1, group2, group3])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Elbow method: find optimal K
inertias = []
silhouettes = []
K_range = range(2, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_scaled)
inertias.append(kmeans.inertia_)
silhouettes.append(silhouette_score(X_scaled, labels))
# Results
for k, inertia, sil in zip(K_range, inertias, silhouettes):
marker = " <-- optimal" if k == 3 else ""
print(f"K={k}: Inertia={inertia:.1f}, Silhouette={sil:.3f}{marker}")
# Final model
kmeans_final = KMeans(n_clusters=3, random_state=42, n_init=10)
clusters = kmeans_final.fit_predict(X_scaled)
print(f"\nCluster distribution: {np.bincount(clusters)}")
DBSCAN: Density-Based Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that does not require specifying the number of clusters in advance. It identifies clusters as dense regions separated by low-density regions. Two parameters control behavior: eps (the neighborhood radius) and min_samples (the minimum number of points to form a cluster).
DBSCAN classifies points into three categories: core points (with at least min_samples neighbors within eps), border points (near a core point but with few neighbors of their own), and noise points (neither core nor border). Unlike K-Means, DBSCAN handles arbitrarily shaped clusters and automatically identifies outliers.
from sklearn.cluster import DBSCAN, AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_moons
import numpy as np
# Dataset with non-spherical clusters (moons)
X, y_true = make_moons(n_samples=300, noise=0.1, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=10)
labels_dbscan = dbscan.fit_predict(X_scaled)
n_clusters = len(set(labels_dbscan)) - (1 if -1 in labels_dbscan else 0)
n_noise = list(labels_dbscan).count(-1)
print(f"DBSCAN: {n_clusters} clusters, {n_noise} noise points")
if n_clusters > 1:
mask = labels_dbscan != -1
sil = silhouette_score(X_scaled[mask], labels_dbscan[mask])
print(f" Silhouette (excluding noise): {sil:.3f}")
# Hierarchical Clustering (Agglomerative)
agg = AgglomerativeClustering(n_clusters=2, linkage='ward')
labels_agg = agg.fit_predict(X_scaled)
sil_agg = silhouette_score(X_scaled, labels_agg)
print(f"\nHierarchical (Ward): Silhouette = {sil_agg:.3f}")
# K-Means for comparison (struggles with moons)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
labels_km = kmeans.fit_predict(X_scaled)
sil_km = silhouette_score(X_scaled, labels_km)
print(f"K-Means: Silhouette = {sil_km:.3f}")
Clustering Evaluation Metrics
Evaluating clustering is harder than supervised classification because there are no reference labels. Silhouette Score measures how similar each point is to its own cluster compared to neighboring clusters: ranges from -1 (wrong assignment) to 1 (dense and well-separated clusters). Davies-Bouldin Index measures average similarity between clusters: lower is better. Calinski-Harabasz Index measures the ratio of inter-cluster to intra-cluster dispersion: higher is better.
KNN vs K-Means: Do not confuse them! KNN is a supervised algorithm for classification/regression (uses labels). K-Means is an unsupervised clustering algorithm (does not use labels). The only thing they share is the letter K.
When to Use Each Algorithm
KNN: for classification problems with small-medium datasets and when the relationship between features and target is local and non-linear. K-Means: for segmentation with spherical and balanced clusters. DBSCAN: when clusters have irregular shapes, there are significant outliers, or the number of clusters is unknown. Hierarchical: when you need to understand the grouping hierarchy and the dataset is not too large.
Key Takeaways
- KNN classifies based on the K nearest neighbors: simple but slow on large datasets
- K selection is done with cross-validation: too low causes overfitting, too high underfitting
- K-Means partitions into K clusters with centroids; requires known K and works with spherical clusters
- DBSCAN finds arbitrarily shaped clusters based on density and identifies outliers
- Silhouette Score is the most used metric for evaluating clustering quality
- Feature scaling is essential for all distance-based algorithms







