How Decision Trees Work
Decision trees are supervised ML algorithms that make decisions through a series of binary questions organized in a tree structure. Each internal node represents a question about a feature (e.g., "Age > 30?"), each branch represents an answer, and each leaf represents the final prediction. Their strength is interpretability: you can read a decision tree like a flowchart and understand exactly why the model made a certain decision.
The algorithm builds the tree by selecting at each node the feature and threshold that best separate the data into more homogeneous subgroups. This recursive process continues until a stopping criterion is met (maximum depth, minimum samples per leaf, leaf purity).
What You Will Learn in This Article
- How decision trees build splitting rules
- Entropy, Information Gain, and Gini Impurity
- ID3, C4.5, and CART algorithms
- Random Forest: the power of ensemble
- Feature Importance and model interpretation
- Pruning techniques to prevent overfitting
Entropy and Information Gain
Entropy measures the disorder (or uncertainty) in a dataset. A dataset with all the same labels has entropy 0 (perfectly pure), while a dataset with equally distributed labels has maximum entropy. Information gain measures how much a feature reduces entropy when used to split the data: the higher the gain, the better the split.
Gini Impurity is an alternative to entropy, computationally faster. It measures the probability that a random sample would be incorrectly classified if assigned a random label based on the class distribution in the node. CART (used by scikit-learn) uses Gini by default, while ID3 and C4.5 use entropy.
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
target_names = iris.target_names
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Tree with Gini (default)
tree_gini = DecisionTreeClassifier(
criterion='gini',
max_depth=3,
min_samples_split=5,
min_samples_leaf=2,
random_state=42
)
tree_gini.fit(X_train, y_train)
# Tree with Entropy
tree_entropy = DecisionTreeClassifier(
criterion='entropy',
max_depth=3,
random_state=42
)
tree_entropy.fit(X_train, y_train)
# Evaluation
for name, tree in [("Gini", tree_gini), ("Entropy", tree_entropy)]:
y_pred = tree.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"{name} - Accuracy: {acc:.3f}")
# Text visualization of the tree
print("\nTree structure (Gini):")
print(export_text(tree_gini, feature_names=list(feature_names)))
Overfitting in Trees and Pruning
Decision trees are particularly prone to overfitting: without constraints, a tree can grow until it has one leaf per training sample, memorizing the data instead of learning generalizable patterns. Pruning is the set of techniques to limit tree complexity.
Pre-pruning (early stopping) limits the tree during construction: maximum depth
(max_depth), minimum samples per split (min_samples_split), minimum samples
per leaf (min_samples_leaf). Post-pruning builds the full tree and then
removes branches that do not improve performance on a validation set. Scikit-learn offers the
ccp_alpha parameter (cost-complexity pruning) for post-pruning.
Random Forest: The Ensemble of Trees
Random Forest is an ensemble algorithm that combines hundreds of decision trees to achieve more accurate and stable predictions. Each tree is trained on a random subset of data (bootstrap sampling) and uses a random subset of features at each split. The final prediction is the majority vote (classification) or the average (regression) of all trees.
This double randomization reduces variance without increasing bias: each individual tree may be imprecise, but the ensemble is much more robust. The Out-of-Bag (OOB) error allows estimating generalization error without a separate validation set, leveraging samples not selected in each tree's bootstrap.
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
# Dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Random Forest
rf = RandomForestClassifier(
n_estimators=200, # 200 trees
max_depth=10, # maximum depth
min_samples_split=5,
oob_score=True, # OOB error estimation
random_state=42,
n_jobs=-1 # use all cores
)
rf.fit(X_train, y_train)
# Performance
y_pred = rf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"OOB Score: {rf.oob_score_:.3f}")
# Feature Importance - which features matter most?
importances = rf.feature_importances_
sorted_idx = np.argsort(importances)[::-1]
print("\nTop 10 Feature Importance:")
for i in range(10):
idx = sorted_idx[i]
print(f" {feature_names[idx]:<30s} {importances[idx]:.4f}")
Feature Importance: Interpreting the Model
One of the most valuable characteristics of trees and Random Forests is feature importance: the ability to quantify how much each feature contributes to predictions. In scikit-learn, importance is calculated as the mean impurity decrease (Gini or entropy) contributed by each feature across all trees in the forest.
This is fundamental for domain understanding, feature selection, and communicating results. If a feature has importance close to zero, it can probably be removed without performance loss, simplifying the model and reducing overfitting risk.
Single tree vs Random Forest: A single tree is interpretable but unstable (small changes in data produce very different trees). Random Forest sacrifices direct interpretability to achieve much more stable and accurate predictions. In practice, Random Forest is almost always preferable, and feature importance compensates for the loss of interpretability.
When to Use Decision Trees
Decision trees and Random Forests excel when interpretable models are needed, when features have different scales (no normalization required), when there are non-linear relationships in the data, and when you want a quick baseline to implement. They are less suitable for datasets with many sparse features (like text mining) or when the relationship between features and target is strongly linear (where linear regression is more efficient).
Key Takeaways
- Decision trees use recursive splits based on entropy or Gini impurity
- Information Gain measures split quality: higher is better
- Pruning (pre and post) is essential to prevent overfitting
- Random Forest combines hundreds of trees with bootstrap and randomized features
- OOB error estimates generalization without a separate validation set
- Feature importance reveals which variables drive model predictions







