Day 37 of 50 Days of Python: Using Scikit-Learn for Data Preprocessing and Modelling

Part of Week 6: Advanced Topics

Jun 27, 2025

Welcome back to Day 37 of the Python series! Today, we are going to build a full preprocessing → modeling pipeline with everyone’s favourite ML toolkit: scikit‑learn. By the end of this post you’ll know how to:

Clean raw tabular data with SimpleImputer, StandardScaler, and OneHotEncoder
Combine transformers with ColumnTransformer
Chain preprocessing and modeling steps together in a Pipeline
Evaluate with cross‑validation and tune hyper‑parameters with GridSearchCV

Why Preprocessing Matters

Models learn patterns in whatever representation you hand them. Missing values, differing scales, or text labels can throw them off. Scikit‑learn treats every cleaning step as a transformer with a consistent API (fit / transform). That makes it easy to compose rock‑solid, repeatable workflows.

General Rule of thumb: If you’ll apply the same step to training and unseen data, wrap it in a transformer.

Dataset For Today: The Titanic

We’ll use Seaborn’s Titanic passenger data, perfect for demo‑sized experiments without the need to start setting up web apps and API keys.

import seaborn as sns
import pandas as pd

titanic = sns.load_dataset("titanic")
titanic.head()

We are going to target the column: survived (1=yes, 0=no)

Splitting Features & Target

X = titanic.drop(columns=["survived"])
y = titanic["survived"]

Building the Preprocessing Script

Identify column types

numeric_features = ["age", "fare", "parch", "sibsp"]
categorical_features = ["sex", "class", "embark_town", "alone"]

Define transformers

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
])

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
])

preprocessor = ColumnTransformer([
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features),
])

The ColumnTransformer routes each subset of columns through the appropriate mini‑pipeline, then stitches everything back into a single NumPy array.

Share Data Bytes & Insights

Attaching a Model: The Full Pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ("preprocess", preprocessor),
    ("classifier", LogisticRegression(max_iter=1000)),
])

Now one object owns every step needed to turn raw rows into predictions.

Train / Test Split & Baseline Performance

from sklearn.model_selection import train_test_split, cross_val_score

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

clf.fit(X_train, y_train)
print("Test accuracy:", clf.score(X_test, y_test).round(3))

Perform Cross-Validation

cv_scores = cross_val_score(clf, X, y, cv=5, scoring="accuracy")
print("CV accuracy:", cv_scores.mean().round(3))

Then Some Hyper-Parameter Tuning with GridSearchCV

from sklearn.model_selection import GridSearchCV

param_grid = {
    "classifier__C": [0.01, 0.1, 1, 10],
    "classifier__penalty": ["l1", "l2"],
    "classifier__solver": ["liblinear"],
}

gs = GridSearchCV(clf, param_grid, cv=5, scoring="accuracy", n_jobs=-1)
gs.fit(X, y)

print("Best params:", gs.best_params_)
print("Best CV score:", gs.best_score_.round(3))

Notice the double underscore classifier__C? Well, that’s how you drill into steps inside a Pipeline.

Next Up: Day 38 - Deep Learning Basics with TensorFlow

We’ll go over the basics of deep learning using the TensorFlow package and the inbuilt MNIST dataset to build a digit classifier. I’m going to keep the content short and concise with useable code.

See you for the next one, and remember… Happy coding!!