The Power of Ensemble Methods
Ensemble methods combine multiple weak models to create a strong one. The intuition is simple: asking many mediocre experts for their opinion and aggregating the answers produces better results than relying on a single expert. This is the technique that wins most ML competitions on Kaggle and powers many production systems. The three pillars of ensembles are bagging, boosting, and stacking.
Bagging (Bootstrap Aggregating) trains independent models on random data subsets and aggregates their predictions. It reduces variance. Boosting trains models sequentially, where each new model corrects the errors of the previous one. It reduces bias. Stacking uses predictions from base models as input for a meta-model that learns to combine them optimally.
What You Will Learn in This Article
- Bagging and Random Forest: reducing variance
- AdaBoost: the first boosting algorithm
- Gradient Boosting: the state of the art
- XGBoost, LightGBM, and CatBoost: modern implementations
- Stacking and Voting: combining heterogeneous models
- Hyperparameter tuning for ensembles
Bagging and Random Forest
Bagging creates N copies of the dataset through bootstrap sampling (sampling with replacement), trains a model on each copy, and aggregates predictions through majority voting (classification) or averaging (regression). Random Forest extends bagging by adding feature randomization: at each split, only a random subset of features is considered. This double randomization produces decorrelated trees, further reducing variance.
AdaBoost: Adaptive Boosting
AdaBoost (Adaptive Boosting) was the first successful boosting algorithm. It trains models sequentially, assigning higher weights to samples misclassified by the previous model. Each new model focuses on the hardest errors. Models are combined with a weighted average, where each model's weight depends on its accuracy.
from sklearn.ensemble import (
BaggingClassifier, RandomForestClassifier,
AdaBoostClassifier, GradientBoostingClassifier,
VotingClassifier, StackingClassifier
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
data = load_breast_cancer()
X, y = data.data, data.target
# Ensemble models
models = {
'Decision Tree': DecisionTreeClassifier(random_state=42),
'Bagging (50 trees)': BaggingClassifier(
estimator=DecisionTreeClassifier(random_state=42),
n_estimators=50, random_state=42, n_jobs=-1
),
'Random Forest': RandomForestClassifier(
n_estimators=100, random_state=42, n_jobs=-1
),
'AdaBoost': AdaBoostClassifier(
n_estimators=100, learning_rate=0.1, random_state=42
),
'Gradient Boosting': GradientBoostingClassifier(
n_estimators=100, learning_rate=0.1,
max_depth=3, random_state=42
)
}
print("Ensemble Methods Comparison:")
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f" {name:<25s}: {scores.mean():.4f} (+/- {scores.std():.4f})")
Gradient Boosting: The King of Competitions
Gradient Boosting builds models sequentially, where each new model is trained on the residuals (errors) of the previous model. It uses gradient descent to minimize an arbitrary loss function. The result is an extremely powerful model that requires careful hyperparameter tuning to avoid overfitting.
XGBoost (eXtreme Gradient Boosting) is the most popular implementation: it adds L1/L2 regularization, native missing value handling, parallelization, and intelligent pruning. LightGBM by Microsoft is optimized for speed and memory on large datasets (grows trees leaf-wise instead of level-wise). CatBoost by Yandex natively handles categorical features without manual encoding.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.datasets import load_breast_cancer
import numpy as np
data = load_breast_cancer()
X, y = data.data, data.target
# scikit-learn Gradient Boosting with Grid Search
param_grid = {
'n_estimators': [50, 100, 200],
'learning_rate': [0.01, 0.05, 0.1],
'max_depth': [3, 5, 7],
'min_samples_split': [2, 5],
'subsample': [0.8, 1.0]
}
gb = GradientBoostingClassifier(random_state=42)
# Note: GridSearchCV with many parameters is slow
# In practice use RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
random_search = RandomizedSearchCV(
gb, param_grid,
n_iter=20, # 20 random combinations
cv=5,
scoring='accuracy',
random_state=42,
n_jobs=-1
)
random_search.fit(X, y)
print(f"Best score: {random_search.best_score_:.4f}")
print(f"Best params: {random_search.best_params_}")
# XGBoost (if installed)
try:
from xgboost import XGBClassifier
xgb = XGBClassifier(
n_estimators=200,
learning_rate=0.05,
max_depth=5,
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0.1, # L1 regularization
reg_lambda=1.0, # L2 regularization
random_state=42,
eval_metric='logloss'
)
scores_xgb = cross_val_score(xgb, X, y, cv=5, scoring='accuracy')
print(f"\nXGBoost: {scores_xgb.mean():.4f} (+/- {scores_xgb.std():.4f})")
except ImportError:
print("\nXGBoost not installed. Install with: pip install xgboost")
Stacking and Voting
Voting Classifier combines predictions from different models. Hard voting uses majority vote. Soft voting averages predicted probabilities (generally more effective). Stacking goes further: it uses base model predictions as input features for a meta-learner that learns the optimal combination.
from sklearn.ensemble import (
VotingClassifier, StackingClassifier,
RandomForestClassifier, GradientBoostingClassifier
)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X, y = data.data, data.target
# Base models
rf = RandomForestClassifier(n_estimators=100, random_state=42)
gb = GradientBoostingClassifier(n_estimators=100, random_state=42)
svm = make_pipeline(StandardScaler(), SVC(probability=True, random_state=42))
lr = make_pipeline(StandardScaler(), LogisticRegression(max_iter=10000))
# Soft Voting
voting = VotingClassifier(
estimators=[('rf', rf), ('gb', gb), ('svm', svm), ('lr', lr)],
voting='soft'
)
# Stacking with LogisticRegression meta-learner
stacking = StackingClassifier(
estimators=[('rf', rf), ('gb', gb), ('svm', svm)],
final_estimator=LogisticRegression(max_iter=10000),
cv=5
)
# Comparison
for name, model in [('RF', rf), ('GB', gb), ('Voting', voting), ('Stacking', stacking)]:
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"{name:<12s}: {scores.mean():.4f} (+/- {scores.std():.4f})")
Bagging vs Boosting: Bagging reduces variance by combining independent models (Random Forest). Boosting reduces bias by sequentially correcting errors (XGBoost). Bagging is parallelizable and more resistant to overfitting. Boosting is generally more powerful but requires more careful tuning. In practice, Gradient Boosting (XGBoost, LightGBM) is often the best choice for tabular data.
Key Takeaways
- Bagging reduces variance by combining independent models (Random Forest)
- Boosting reduces bias by sequentially correcting errors (AdaBoost, Gradient Boosting)
- XGBoost/LightGBM are the state of the art for tabular data
- Low learning rate + many estimators = better results but slower training
- Stacking combines heterogeneous models with a meta-learner
- RandomizedSearchCV is more efficient than GridSearchCV with many hyperparameters







