Day 07 of SQL for Data Engineers: Text Cleaning & Regex
Part of Week 1: Query Essentials
Today we’ll cover making messy strings usable. You’ll standardise case and whitespace, derive fields (like email domains), validate text with patterns/regex, and learn where text cleaning belongs in a medallion flow so your queries stay fast and reliable.
TL;DR
Clean at the edge (Bronze) so downstream (Silver/Gold) stays light.
Normalise whitespace & case (
TRIM,LOWER/UPPER) before comparisons.Use simple string ops first, then regex only where needed.
Regex on big tables is expensive: anchor patterns, prefilter, and persist results.
The Patterns
Normalise whitespace & case (the 80/20)
Data often arrives with extra spaces or mixed case. Normalising improves joins, dedupe, and filters.
-- Canonical email & name fields
SELECT
customer_id,
LOWER(TRIM(email)) AS email_norm,
INITCAP(TRIM(first_name)) AS first_name_norm, -- DuckDB/Postgres/Spark: INITCAP
INITCAP(TRIM(last_name)) AS last_name_norm
FROM analytics.customers;Do this once in Silver and reuse; don’t repeat TRIM/LOWER in every query.
Safer comparisons (avoid false mismatches)
Comparing raw strings leads to misses (“ JON@X.COM ” vs “jon@x.com”). Compare on normalised fields.
-- Find duplicates by normalised email
WITH c AS (
SELECT customer_id, LOWER(TRIM(email)) AS email_norm
FROM analytics.customers
)
SELECT email_norm, COUNT(*) AS cnt
FROM c
WHERE email_norm IS NOT NULL AND email_norm <> ''
GROUP BY email_norm
HAVING COUNT(*) > 1;Derive useful parts (split & extract)
Turn a free form string into structured fields you can group on.
-- Email domain derivation (portable form)
SELECT
customer_id,
LOWER(TRIM(email)) AS email_norm,
SUBSTRING(LOWER(TRIM(email)) FROM POSITION('@' IN LOWER(TRIM(email))) + 1) AS email_domain
FROM analytics.customers
WHERE email IS NOT NULL AND TRIM(email) <> '';Validate formats with regex
Regex can flag obviously bad values. Keep patterns simple and anchored (^…$) and run on filtered subsets to control cost.
-- “Looks like an email” (not perfect, but practical)
-- Postgres/DuckDB: ~ is regex; ~* is case-insensitive
SELECT customer_id, email
FROM analytics.customers
WHERE LOWER(TRIM(email)) ~ '^[a-z0-9._%+\-]+@[a-z0-9.\-]+\.[a-z]{2,}$';
-- Spark: rlike
-- WHERE lower(trim(email)) rlike '^[a-z0-9._%+\\-]+@[a-z0-9.\\-]+\\.[a-z]{2,}$'Clean numbers inside strings (e.g., phone)
Strip non-digits to standardise before validation or formatting.
-- Keep digits only → ' +44 (0) 7912-345-678 ' → '447912345678'
-- Postgres/DuckDB/Spark:
SELECT REGEXP_REPLACE(phone_raw, '[^0-9]', '', 'g') AS phone_digits
FROM some_table;Extract with regex groups (targeted parsing)
Use capture groups to pull a specific piece (e.g., a UTM parameter, a code).
-- Extract 3-letter country at end of site like 'jdsportsuk' -> 'uk'
-- Spark: regexp_extract(site, '.*?([a-z]{2})$', 1)
-- Postgres/DuckDB:
SELECT
site,
REGEXP_REPLACE(site, '.*?([a-z]{2})$', '\1') AS site_cc
FROM analytics.orders;Where this lives in Medallion
Bronze: store raw; add minimal metadata (
ingest_date,source_file).Silver: normalise (
TRIM/LOWER), derive fields (email_domain), and persist them.Gold: aggregate/serve; avoid heavy string work here.
Push heavy text ops left (Bronze→Silver); Gold should be reading, not scrubbing.
Data Engineering Wisdom
Add a DQ rule for text fields (e.g., “emails must match pattern OR be blank”). Record failures in a quarantine table.
Create materialised columns (
email_norm,email_domain) so joins/grouping don’t recompute functions.On big tables, prefilter (
WHERE email LIKE '%@%') before regex; measure withEXPLAIN/ANALYZE.Keep patterns anchored (
^…$) and documented in a reference file.Avoid over-validating: aim for “useful, not perfect” (true validation often belongs in the app layer).
Knowledge Check - Day 07 + Day 06 Answers
Answers to yesterdays quiz:
Q1: Choice 2
Q2: Choice 1
Tips for Different Engines
Postgres:
LOWER,TRIM,split_part,regexp_replace, regex operators~/~*.DuckDB: supports
lower,trim,regexp_replace,regexp_matches;initcapavailable.SparkSQL:
lower,trim,initcap,split,regexp_replace,regexp_extract,rlike.SQL Server: no native regex; use
LIKE,PATINDEX,TRANSLATE/REPLACE, or perform regex upstream (e.g., Spark, Python) before persisting to SQL Server.
What’s Next?
Day 08: Dates, Time Zones & Watermarks
We’ll be moving into week 2: Physical Design & Storage where we’ll start off with covering how to master date math and truncation, avoid daylight-saving traps, and set up watermarking windows that power your incremental loads.
Thanks for following along and I hope you enjoyed week 1 and the new format of the series. See you next week and… Happy Coding!


