Day 44 of 50 Days of Python: Writing Unit & Integration Tests in Python
Part of Week 7: Python In Production
Yesterday we tamed our logs; today we climb the testing pyramid to ensure our code runs correctly after every commit. Unit tests validate individual functions in microseconds, while integration tests confirm that real databases, APIs, and message queues play nicely together. Together they deliver early‑warning radar for regressions, unlock fearless refactoring, and strengthen every stage of your CI/CD pipeline.
What We’ll Cover
Difference between unit, integration, and end‑to‑end (E2E) tests.
Organising tests with pytest and the tests/ folder pattern.
Using fixtures, parametrisation, and markers for clean, DRY tests.
Mocking I/O with unittest.mock & pytest‑mock.
Running integration tests in Docker Compose with a real Postgres instance.
Key Concepts
→ Test discovery – tests/**/test_*.py: where pytest automatically finds test files.
→ Markers – @pytest.mark.unit, @integration: label and select subsets with -m.
→ Coverage threshold – --cov-fail-under=90: fail CI if coverage drops below 90 %.
→ Fixture scope – function, module, session: control resource lifetime (e.g., DB, API client).
→Parametrisation – @pytest.mark.parametrize: run the same test with many inputs effortlessly.
→ Xfail/Skip – @pytest.mark.xfail, skipif: document expected failures or conditional skips.
→ Monkeypatch – monkeypatch.setenv(): temporarily modify env vars or attributes during a test.
Hands‑On: Unit Tests for Data-Engineering Functions
Data engineers spend most of their time massaging DataFrames. Let’s test a couple of transformation helpers that power yesterday’s pipelines.
import pandas as pd
# toy helper – compute price per square foot
def add_price_per_sqft(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
df["price_per_sqft"] = df["median_house_value"] / (df["total_rooms"] / df["households"])
return df
# normalise categorical column
def normalise_ocean_proximity(df: pd.DataFrame) -> pd.DataFrame:
mapping = {"<1H OCEAN": "NEAR", "INLAND": "FAR", "ISLAND": "ISLAND"}
return df.assign(ocean_proximity=df["ocean_proximity"].map(mapping))
Unit Tests:
import pandas as pd
import pytest
from src.transformations import add_price_per_sqft, normalise_ocean_proximity
@pytest.fixture
def raw_df():
return pd.DataFrame({
"median_house_value": [300_000, 120_000],
"total_rooms": [6, 4],
"households": [2, 1],
"ocean_proximity": ["<1H OCEAN", "INLAND"],
})
@pytest.mark.unit
def test_add_price_per_sqft(raw_df):
out = add_price_per_sqft(raw_df)
assert "price_per_sqft" in out.columns
assert out["price_per_sqft"].between(0, 1_000_000).all() # sanity‑check
@pytest.mark.unit
def test_normalise_ocean_proximity(raw_df):
out = normalise_ocean_proximity(raw_df)
assert set(out["ocean_proximity"]).issubset({"NEAR", "FAR", "ISLAND"})Mocking External Calls
pytest -q -m unit --cov=src.transformations
Integration Test: End‑to‑End ETL Pipeline
Suppose you have an ETL script that reads raw CSV, applies the transformations, and loads the result into DuckDB inside a Docker container. You want to guarantee that a full run completes and populates the warehouse with the expected record count.
import subprocess, duckdb, pathlib, os, pytest
WAREHOUSE = pathlib.Path("/tmp/warehouse.duckdb")
@pytest.mark.integration
def test_etl_pipeline(tmp_path):
raw_csv = tmp_path / "housing_raw.csv"
# write tiny sample
raw_csv.write_text("median_house_value,total_rooms,households,ocean_proximity
300000,6,2,<1H OCEAN
120000,4,1,INLAND
")
# run the ETL script (could be Python, Airflow DAG run, etc.)
exit_code = subprocess.call(["python", "src/etl.py", str(raw_csv), str(WAREHOUSE)])
assert exit_code == 0
con = duckdb.connect(str(WAREHOUSE))
rows = con.execute("SELECT count(*) FROM calihouse").fetchone()[0]
assert rows == 2
pytest -m integration -q
TL;DR
Unit tests isolate single functions with fast execution & mocking.
Integration tests spin up real dependencies (API, DB) to verify contracts.
Pytest fixtures, parametrisation, and markers keep suites clean and expressive.
Run pytest -m "unit or integration" --cov inside CI to guard every commit.
Next Up: Day 45 - Code Optimisation and Profiling Tools.
I’ll take you through optimisation strategies and profiling techniques which are backbones of production‑ready pipelines and code.
See you for the next one and, as always… Happy coding!