Day 37 of 50 Days of Python: Using Scikit-Learn for Data Preprocessing and Modelling
Part of Week 6: Advanced Topics
Welcome back to Day 37 of the Python series! Today, we are going to build a full preprocessing → modeling pipeline with everyone’s favourite ML toolkit: scikit‑learn. By the end of this post you’ll know how to:
Clean raw tabular data with SimpleImputer, StandardScaler, and OneHotEncoder
Combine transformers with ColumnTransformer
Chain preprocessing and modeling steps together in a Pipeline
Evaluate with cross‑validation and tune hyper‑parameters with GridSearchCV
Why Preprocessing Matters
Models learn patterns in whatever representation you hand them. Missing values, differing scales, or text labels can throw them off. Scikit‑learn treats every cleaning step as a transformer with a consistent API (fit / transform). That makes it easy to compose rock‑solid, repeatable workflows.
General Rule of thumb: If you’ll apply the same step to training and unseen data, wrap it in a transformer.
Dataset For Today: The Titanic
We’ll use Seaborn’s Titanic passenger data, perfect for demo‑sized experiments without the need to start setting up web apps and API keys.
import seaborn as sns
import pandas as pd
titanic = sns.load_dataset("titanic")
titanic.head()
We are going to target the column: survived (1=yes, 0=no)
Splitting Features & Target
X = titanic.drop(columns=["survived"])
y = titanic["survived"]
Building the Preprocessing Script
Identify column types
numeric_features = ["age", "fare", "parch", "sibsp"] categorical_features = ["sex", "class", "embark_town", "alone"]
Define transformers
from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer numeric_transformer = Pipeline([ ("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler()), ]) categorical_transformer = Pipeline([ ("imputer", SimpleImputer(strategy="most_frequent")), ("onehot", OneHotEncoder(handle_unknown="ignore")), ]) preprocessor = ColumnTransformer([ ("num", numeric_transformer, numeric_features), ("cat", categorical_transformer, categorical_features), ])
The ColumnTransformer routes each subset of columns through the appropriate mini‑pipeline, then stitches everything back into a single NumPy array.
Attaching a Model: The Full Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
clf = Pipeline([
("preprocess", preprocessor),
("classifier", LogisticRegression(max_iter=1000)),
])
Now one object owns every step needed to turn raw rows into predictions.
Train / Test Split & Baseline Performance
from sklearn.model_selection import train_test_split, cross_val_score
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
clf.fit(X_train, y_train)
print("Test accuracy:", clf.score(X_test, y_test).round(3))
Perform Cross-Validation
cv_scores = cross_val_score(clf, X, y, cv=5, scoring="accuracy")
print("CV accuracy:", cv_scores.mean().round(3))
Then Some Hyper-Parameter Tuning with GridSearchCV
from sklearn.model_selection import GridSearchCV
param_grid = {
"classifier__C": [0.01, 0.1, 1, 10],
"classifier__penalty": ["l1", "l2"],
"classifier__solver": ["liblinear"],
}
gs = GridSearchCV(clf, param_grid, cv=5, scoring="accuracy", n_jobs=-1)
gs.fit(X, y)
print("Best params:", gs.best_params_)
print("Best CV score:", gs.best_score_.round(3))
Notice the double underscore classifier__C? Well, that’s how you drill into steps inside a Pipeline.
Next Up: Day 38 - Deep Learning Basics with TensorFlow
We’ll go over the basics of deep learning using the TensorFlow package and the inbuilt MNIST dataset to build a digit classifier. I’m going to keep the content short and concise with useable code.
See you for the next one, and remember… Happy coding!!