The Ultimate Guide to Testing in Data Engineering: When, How, and Why It Matters

If you don't test your pipelines, your users will.

Apr 22, 2025

In software engineering, testing is a default expectation. In data engineering, it's often overlooked, until something inevitably breaks. The consequences of a bad data pipeline can be silent and often leads to angry stakeholders:

Broken dashboards misleading leadership.
Incorrect models influencing critical decisions.
Data breaches or leakage of PII.

Testing adds guardrails, ensuring transformations, pipelines, and infrastructure are reliable, scalable, and correct. As modern data stacks become more complex with real-time streams, feature stores, and ML integration, testing is no longer “optional”.

When Should You Test?

Testing isn’t something you "add later." Although technically you can, It should be added from Day 1. Here's what a typical lifecycle looks like and what tests you should be doing at each stage:

Let’s go into a bit more detail and provide details of data engineering tests with examples, so you can get more of a feel for it:

Contract & Schema Testing - Design Phase
- Validate schema shape, types, nullability and the presence of key fields.
- Useful for batch ingestion jobs , streaming and APIs.
```
Expect: {
  "user_id": "int",
  "event_type": "string",
  "timestamp": "datetime"
}
Fail if:
- `user_id` is null
- unexpected column appears (e.g., `promo_code`)
```

Unit Tests - Development Phase

Simulate cases using mock data.
Test single notebook functions, SQL CTEs, or transformation blocks.

def test_null_cleanup():
    df = spark.createDataFrame([
        ("user1", None),
        ("user2", "UK")
    ], ["user_id", "country"])

    result = fill_null_country(df)
    assert "Unknown" in result.select("country").distinct().collect()

Integration Tests - Development Phase

Validate joins, aggregates and filters across datasets.
Run these against dev/test data.

-- Expect every purchase to map to a known user
SELECT COUNT(*) 
FROM gold.purchases p
LEFT JOIN silver.users u ON p.user_id = u.user_id
WHERE u.user_id IS NULL;

Infrastructure / CI Tests - Deployment Phase
- Validate Key Vaults, storage mounts and workspace creation.
```
az storage container show --name raw-data --account-name mystorage
```
Data Validation Tests - Post-Deployment Phase
- Row counts, freshness, deduplication, null/empty checks.
```
SELECT COUNT(*) 
FROM bronze.events 
WHERE _ingest_timestamp >= CURRENT_DATE()
```

Anomaly & Regression Tests - Production Phase

Detect schema drift, volume spikes or model drift.

-- Trigger an alert if daily sales deviate by >30% from 7-day average:

WITH stats AS (
  SELECT 
    AVG(daily_sales) AS avg_sales, 
    STDDEV(daily_sales) AS std_dev
  FROM metrics.sales_rolling_7
)
SELECT *
FROM daily_sales
WHERE sales_amount > avg_sales + 3 * std_dev

Tools and Frameworks to Know

Yes, mentioning all these different testing examples and where to use them is helpful. But what’s more helpful is knowing what tools/frameworks to implement in what area. So I’ve put together this table that helps define this. If you have any questions about this let me know in the comments and I’dd do my best to address them.

Share Data Bytes & Insights

Best Practices, Tips & Gotchas

Test small → then scale

Start with the highest-risk logic: joins, filters, and aggregates. You don’t need to test everything , just what can break silently and cause headaches.

Shift left

Catch issues early in dev with unit + contract testing. Saves you hours in debugging production pipelines.

Automate tests

Add your tests to Git-based CI pipelines. Especially for notebooks, SQL models, or IaC (infrastructure-as-code).

Alert on failures

Testing without monitoring is only half the picture. Pair your validation logic with real-time alerts.

Bundle synthetic data

Mock cases using small datasets and reuse them across pipelines. It’s your safety net for logic tests. Easy to deploy and easy to update.

It Can be Tough Changing the Status Quo

Although all this is awesome information and I do implore all data professionals to stick to these. I can completely understand that some workplaces often act as the laces to their own shoes and trip themselves up when it comes to implementation of proper testing within data pipelines. there are numerous reasons why:

Management not understanding the reasoning for testing and thinking its “dead” work.
Pressure from stakeholders above to get work complete as quickly as possible.
Lack of knowledge of how and when to implement testing across pipelines.

The truth is, data professionals at all levels need to understand that tests aren’t optional. They’re there to make sure that those dashboards BI build for stakeholders or models ML Engineers have developed don’t fall over and start bogus recommendations etc. Let’s hope that we can start making testing the norm to save us ALL the headaches!

Join Jonathon Kindred’s subscriber chat

Available in the Substack app and on web

ParadoxicallyChristian

Apr 22

I had on my list today to look online on this topic and I had no idea where to start. One of the projects I QA is data heavy and has so many moving pieces ...

This gives me a direction...so thank you.

Expand full comment

1 reply by Jonathon Kindred

Hugo Acosta

May 2

Analyzing, testing software and dissecting pipelines is also just plain fun. Also, this step is fundamental in the developer's pipeline as more often than not, this process leads to realizations and discoveries about the models we produce that we might've not considered during the creation phase.

1 more comment...

Data Bytes & Insights