Day 07 of SQL for Data Engineers: Text Cleaning & Regex

Part of Week 1: Query Essentials

Sep 07, 2025

Today we’ll cover making messy strings usable. You’ll standardise case and whitespace, derive fields (like email domains), validate text with patterns/regex, and learn where text cleaning belongs in a medallion flow so your queries stay fast and reliable.

TL;DR

Clean at the edge (Bronze) so downstream (Silver/Gold) stays light.
Normalise whitespace & case (TRIM, LOWER/UPPER) before comparisons.
Use simple string ops first, then regex only where needed.
Regex on big tables is expensive: anchor patterns, prefilter, and persist results.

The Patterns

Normalise whitespace & case (the 80/20)

Data often arrives with extra spaces or mixed case. Normalising improves joins, dedupe, and filters.

-- Canonical email & name fields
SELECT
  customer_id,
  LOWER(TRIM(email))        AS email_norm,
  INITCAP(TRIM(first_name)) AS first_name_norm,   -- DuckDB/Postgres/Spark: INITCAP
  INITCAP(TRIM(last_name))  AS last_name_norm
FROM analytics.customers;

Do this once in Silver and reuse; don’t repeat TRIM/LOWER in every query.

Safer comparisons (avoid false mismatches)

Comparing raw strings leads to misses (“ JON@X.COM ” vs “jon@x.com”). Compare on normalised fields.

-- Find duplicates by normalised email
WITH c AS (
  SELECT customer_id, LOWER(TRIM(email)) AS email_norm
  FROM analytics.customers
)
SELECT email_norm, COUNT(*) AS cnt
FROM c
WHERE email_norm IS NOT NULL AND email_norm <> ''
GROUP BY email_norm
HAVING COUNT(*) > 1;

Derive useful parts (split & extract)

Turn a free form string into structured fields you can group on.

-- Email domain derivation (portable form)
SELECT
  customer_id,
  LOWER(TRIM(email)) AS email_norm,
  SUBSTRING(LOWER(TRIM(email)) FROM POSITION('@' IN LOWER(TRIM(email))) + 1) AS email_domain
FROM analytics.customers
WHERE email IS NOT NULL AND TRIM(email) <> '';

Validate formats with regex

Regex can flag obviously bad values. Keep patterns simple and anchored (^…$) and run on filtered subsets to control cost.

-- “Looks like an email” (not perfect, but practical)
-- Postgres/DuckDB: ~ is regex; ~* is case-insensitive
SELECT customer_id, email
FROM analytics.customers
WHERE LOWER(TRIM(email)) ~ '^[a-z0-9._%+\-]+@[a-z0-9.\-]+\.[a-z]{2,}$';

-- Spark: rlike
-- WHERE lower(trim(email)) rlike '^[a-z0-9._%+\\-]+@[a-z0-9.\\-]+\\.[a-z]{2,}$'

Clean numbers inside strings (e.g., phone)

Strip non-digits to standardise before validation or formatting.

-- Keep digits only → ' +44 (0) 7912-345-678 ' → '447912345678'
-- Postgres/DuckDB/Spark:
SELECT REGEXP_REPLACE(phone_raw, '[^0-9]', '', 'g') AS phone_digits
FROM some_table;

Extract with regex groups (targeted parsing)

Use capture groups to pull a specific piece (e.g., a UTM parameter, a code).

-- Extract 3-letter country at end of site like 'jdsportsuk' -> 'uk'
-- Spark: regexp_extract(site, '.*?([a-z]{2})$', 1)
-- Postgres/DuckDB:
SELECT
  site,
  REGEXP_REPLACE(site, '.*?([a-z]{2})$', '\1') AS site_cc
FROM analytics.orders;

Bronze: store raw; add minimal metadata (ingest_date, source_file).
Silver: normalise (TRIM/LOWER), derive fields (email_domain), and persist them.
Gold: aggregate/serve; avoid heavy string work here.

Push heavy text ops left (Bronze→Silver); Gold should be reading, not scrubbing.

Data Engineering Wisdom

Add a DQ rule for text fields (e.g., “emails must match pattern OR be blank”). Record failures in a quarantine table.
Create materialised columns (email_norm, email_domain) so joins/grouping don’t recompute functions.
On big tables, prefilter (WHERE email LIKE '%@%') before regex; measure with EXPLAIN/ANALYZE.
Keep patterns anchored (^…$) and documented in a reference file.
Avoid over-validating: aim for “useful, not perfect” (true validation often belongs in the app layer).

Knowledge Check - Day 07 + Day 06 Answers

Answers to yesterdays quiz:

Q1: Choice 2

Q2: Choice 1

Tips for Different Engines

Postgres: LOWER, TRIM, split_part, regexp_replace, regex operators ~/~*.
DuckDB: supports lower, trim, regexp_replace, regexp_matches; initcap available.
SparkSQL: lower, trim, initcap, split, regexp_replace, regexp_extract, rlike.
SQL Server: no native regex; use LIKE, PATINDEX, TRANSLATE/REPLACE, or perform regex upstream (e.g., Spark, Python) before persisting to SQL Server.

What’s Next?

Day 08: Dates, Time Zones & Watermarks

We’ll be moving into week 2: Physical Design & Storage where we’ll start off with covering how to master date math and truncation, avoid daylight-saving traps, and set up watermarking windows that power your incremental loads.

Thanks for following along and I hope you enjoyed week 1 and the new format of the series. See you next week and… Happy Coding!

Data Bytes & Insights

Discussion about this post