The Importance of Data Preprocessing
Feature engineering and data preprocessing are the most critical phases of any Machine Learning project. An unwritten rule states that 80% of a data scientist's time is spent preparing data, and only 20% on modeling. No matter how sophisticated the algorithm: if the input data is dirty, incomplete, or poorly represented, the model will produce poor results. Garbage in, garbage out.
Preprocessing transforms raw data into a format suitable for the algorithm. Feature engineering goes further: it creates new variables from existing ones, leveraging domain knowledge to capture relationships that the algorithm alone would not find. Together, these phases determine the success or failure of an ML project.
What You Will Learn in This Article
- Techniques for handling missing values
- Encoding categorical variables
- Scaling and normalizing numerical features
- Outlier detection and handling
- Creating new features with domain knowledge
- Preprocessing pipelines with scikit-learn
Handling Missing Values
Real-world data almost always contains missing values (NaN, null). The main strategies are three: deletion (removing rows or columns with too many missing values), statistical imputation (replacing with mean, median, or mode), and predictive imputation (using a model to predict missing values). The choice depends on the amount of missing data and the missingness pattern (random or systematic).
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
# Dataset with missing values
data = pd.DataFrame({
'age': [25, 30, np.nan, 45, 35, np.nan, 28, 50],
'income': [30000, np.nan, 45000, 60000, np.nan, 55000, 32000, 70000],
'category': ['A', 'B', 'A', np.nan, 'B', 'A', 'B', 'A'],
'target': [0, 1, 0, 1, 1, 0, 1, 0]
})
print("Missing values per column:")
print(data.isnull().sum())
print(f"\nMissing percentage:\n{(data.isnull().mean() * 100).round(1)}")
# Strategy 1: Mean/median imputation
imputer_mean = SimpleImputer(strategy='mean')
data['age_imputed'] = imputer_mean.fit_transform(data[['age']])
imputer_median = SimpleImputer(strategy='median')
data['income_imputed'] = imputer_median.fit_transform(data[['income']])
# Strategy 2: KNN imputation (uses neighbors)
knn_imputer = KNNImputer(n_neighbors=3)
numeric_cols = data[['age', 'income']].values
imputed_knn = knn_imputer.fit_transform(numeric_cols)
# Strategy 3: Categorical imputation with mode
imputer_mode = SimpleImputer(strategy='most_frequent')
data['category_imputed'] = imputer_mode.fit_transform(data[['category']])
print("\nAfter imputation:")
print(data[['age_imputed', 'income_imputed', 'category_imputed']].head())
Encoding Categorical Variables
ML algorithms work with numbers, not strings. Categorical variables must be converted to numerical format. Label Encoding assigns an integer to each category (A=0, B=1, C=2): simple but introduces a non-existent order. One-Hot Encoding creates a binary column for each category: does not introduce order but can generate many columns with high-cardinality categoricals. Target Encoding replaces each category with the target mean for that category: powerful but risky for overfitting.
from sklearn.preprocessing import (
LabelEncoder, OneHotEncoder, OrdinalEncoder,
StandardScaler, MinMaxScaler, RobustScaler
)
from sklearn.compose import ColumnTransformer
import pandas as pd
import numpy as np
# Example dataset
df = pd.DataFrame({
'color': ['red', 'blue', 'green', 'red', 'blue'],
'size': ['S', 'M', 'L', 'XL', 'M'],
'price': [10.5, 25.0, 45.0, 80.0, 22.0],
'weight': [100, 500, 1200, 3000, 450]
})
# One-Hot for color (nominal, no order)
ohe = OneHotEncoder(sparse_output=False, drop='first')
color_encoded = ohe.fit_transform(df[['color']])
print(f"One-Hot color:\n{color_encoded}")
# Ordinal for size (ordinal, has an order)
oe = OrdinalEncoder(categories=[['S', 'M', 'L', 'XL']])
size_encoded = oe.fit_transform(df[['size']])
print(f"\nOrdinal size: {size_encoded.flatten()}")
# --- SCALING ---
# StandardScaler: mean=0, std=1 (for normal distributions)
ss = StandardScaler()
price_standard = ss.fit_transform(df[['price']])
# MinMaxScaler: range [0,1] (for non-normal distributions)
mms = MinMaxScaler()
price_minmax = mms.fit_transform(df[['price']])
# RobustScaler: uses median and IQR (robust to outliers)
rs = RobustScaler()
weight_robust = rs.fit_transform(df[['weight']])
print(f"\nStandard: {price_standard.flatten().round(2)}")
print(f"MinMax: {price_minmax.flatten().round(2)}")
print(f"Robust: {weight_robust.flatten().round(2)}")
Outlier Detection
Outliers are anomalous values that deviate significantly from the rest of the data. They can be measurement errors, corrupted data, or genuine extreme values. The IQR (Interquartile Range) method identifies outliers as points beyond 1.5 times the IQR from the first or third quartile. The Z-score method identifies points with standardized values beyond a threshold (typically 3). Isolation Forest is an ML approach that isolates outliers using random decision trees.
Feature Selection
Not all features contribute positively to the model. Irrelevant or redundant features can worsen performance and slow down training. Feature selection identifies the most informative variables. Methods include: correlation (remove highly correlated features), variance threshold (remove low-variance features), SelectKBest (select the K best according to a statistical test), and Random Forest feature importance.
Preprocessing Pipeline with scikit-learn
scikit-learn Pipelines chain preprocessing and modeling steps into a single object. This prevents data leakage (when test set information contaminates training) and simplifies cross-validation and deployment. ColumnTransformer allows applying different transformations to different columns.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import pandas as pd
import numpy as np
# Realistic dataset
np.random.seed(42)
n = 200
df = pd.DataFrame({
'age': np.random.randint(18, 70, n).astype(float),
'income': np.random.normal(40000, 15000, n),
'experience': np.random.randint(0, 30, n).astype(float),
'city': np.random.choice(['Milan', 'Rome', 'Naples', 'Turin'], n),
'education': np.random.choice(['Diploma', 'Bachelor', 'Master'], n),
'target': np.random.randint(0, 2, n)
})
# Insert random missing values
for col in ['age', 'income', 'experience']:
mask = np.random.random(n) < 0.1
df.loc[mask, col] = np.nan
# Define columns by type
numeric_features = ['age', 'income', 'experience']
categorical_features = ['city', 'education']
# Numeric preprocessing: impute + scale
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Categorical preprocessing: impute + one-hot
categorical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(drop='first', handle_unknown='ignore'))
])
# ColumnTransformer combines everything
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Full pipeline: preprocessing + model
full_pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# Cross-validation (preprocessing happens INSIDE each fold)
X = df.drop('target', axis=1)
y = df['target']
scores = cross_val_score(full_pipeline, X, y, cv=5, scoring='accuracy')
print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
Data Leakage: Preprocessing (scaling, imputation) must happen after the train/test split, never before. If you scale on the entire dataset, the test set influences the scaler parameters. The scikit-learn Pipeline automatically prevents this issue by applying fit_transform only on the training set and transform on the test set.
Key Takeaways
- Preprocessing is the most critical phase: 80% of time goes to data preparation
- Missing values: deletion, statistical or predictive imputation depending on context
- Encoding: One-Hot for nominal, Ordinal for ordinal, Target Encoding with caution
- Scaling: StandardScaler for normal distributions, RobustScaler with outliers
- Pipeline + ColumnTransformer prevent data leakage and simplify code
- Feature engineering with domain knowledge often makes more difference than algorithm choice







