Day 04 of SQL for Data Engineers: JOINS 201
Part of Week 1: Query Essentials
Welcome to day 4 of the series where we’ll be covering the existence operators that data engineers use every day to answer “does a match exist?” without duplicating rows. We’ll cover EXISTS/NOT EXISTS, show why NOT IN can be dangerous with NULLs, and map these concepts to LEFT SEMI/LEFT ANTI joins (common in Spark). You’ll also use them for “customers without orders,” CDC/new rows, and deduping without window functions.
TL;DR
Semi join (=
EXISTS): return left rows that have at least one match on the right—no duplication.Anti join (=
NOT EXISTS): return left rows that have no match on the right.Avoid
NOT INwhen the subquery may produce NULLs; it can silently return no rows. UseNOT EXISTSinstead.Spark has keywords:
LEFT SEMI JOINandLEFT ANTI JOIN, equivalent toEXISTS/NOT EXISTS.
JOIN Patterns Pt. 2
Semi join with EXISTS - Keep rows that have a match
EXISTS answers “does at least one matching row exist?” It does not duplicate left rows even if many matches exist on the right.
-- Customers who placed orders in July 2025 (at least one)
SELECT c.customer_id, c.email
FROM analytics.customers c
WHERE EXISTS (
SELECT 1
FROM analytics.orders o
WHERE o.customer_id = c.customer_id
AND o.order_ts >= TIMESTAMP '2025-07-01'
AND o.order_ts < TIMESTAMP '2025-08-01'
);Anti join with NOT EXISTS - keep rows with no match
NOT EXISTS returns left rows for which no matching row exists on the right—great for gap analysis.
-- Customers with NO orders in July 2025
SELECT c.customer_id, c.email
FROM analytics.customers c
WHERE NOT EXISTS (
SELECT 1
FROM analytics.orders o
WHERE o.customer_id = c.customer_id
AND o.order_ts >= TIMESTAMP '2025-07-01'
AND o.order_ts < TIMESTAMP '2025-08-01'
);Why NOT IN Coupled with NULLS can Cause Issues
If the subquery behind NOT IN produces a single NULL, SQL’s three-valued logic can make the entire NOT IN predicate unknown, returning zero rows unexpectedly.
-- Risky if subquery can return NULLs
SELECT c.customer_id
FROM analytics.customers c
WHERE c.customer_id NOT IN (
SELECT o.customer_id -- if any NULL here → predicate becomes UNKNOWN
FROM analytics.orders o
);Safe alternatives
Use
NOT EXISTS(preferred), orFilter out NULLs explicitly:
WHERE c.customer_id NOT IN (SELECT o.customer_id FROM ... WHERE o.customer_id IS NOT NULL)
When in doubt, use NOT EXISTS.
Anti join via LEFT JOIN → IS NULL
Same logic as NOT EXISTS, expressed with a join: join and then keep rows where no match occurred.
-- Customers with NO July orders (anti join form)
SELECT c.customer_id, c.email
FROM analytics.customers c
LEFT JOIN analytics.orders o
ON o.customer_id = c.customer_id
AND o.order_ts >= TIMESTAMP '2025-07-01'
AND o.order_ts < TIMESTAMP '2025-08-01'
WHERE o.customer_id IS NULL;Equivalent to NOT EXISTS; pick the style your team prefers.
SparkSQL Keywords - LEFT SEMI/LEFT ANTI
Spark exposes semi/anti joins directly—handy for readability and performance on big datasets.
-- Spark: customers with July orders (semi)
SELECT c.*
FROM analytics.customers c
LEFT SEMI JOIN analytics.orders o
ON o.customer_id = c.customer_id
AND o.order_ts >= DATE '2025-07-01'
AND o.order_ts < DATE '2025-08-01';
-- Spark: customers without July orders (anti)
SELECT c.*
FROM analytics.customers c
LEFT ANTI JOIN analytics.orders o
ON o.customer_id = c.customer_id
AND o.order_ts >= DATE '2025-07-01'
AND o.order_ts < DATE '2025-08-01';CDC/new rows only with NOT EXISTS
Load only records that your target (Gold) doesn’t have yet, idempotent and incremental. A major factor for inline intraday reporting and event based loading.
-- New orders not yet in gold.fct_order
SELECT o.*
FROM analytics.orders o
WHERE NOT EXISTS (
SELECT 1
FROM gold.fct_order f
WHERE f.order_id = o.order_id
);Spice it up with a watermark for time-bounded loads:
AND o.order_ts > (SELECT COALESCE(last_order_ts, TIMESTAMP '1900-01-01') FROM ops.high_watermark)Data Engineering Wisdom
Prefer
EXISTS/NOT EXISTSfor boolean membership tests—stable and optimizer-friendly.For “without” logic, resist
NOT INunless you’ve eliminated NULLs in the subquery.In outer joins, keep right-side filters in
ONwhen you intend to preserve left rows (then useIS NULLfor anti).In CDC jobs, pair
NOT EXISTSwith a watermark for idempotent re-runs.
Mini-exercises (5–10 minutes)
July buyers (semi): Return
customer_idof customers who placed at least one order in July 2025 usingEXISTS.Expected: customers 1 and 2.
No July orders (anti): Return all customers with no July 2025 orders using
NOT EXISTS.Expected: customers 3 and 4.
Ever ordered? (anti join form): Using a
LEFT JOIN … IS NULL, list customers who have never placed any order.Expected: customer 4.
Latest per order_id (no windows): Using the self-anti pattern, keep only the latest row per
order_id.Expected: with the seed set (one row per order), returns all orders unchanged.
Knowledge Check - Day 04 + Day 03 Answers
Answers to yesterdays quiz:
Q1: Choice 2
Q2: Choice 2
Tips for Different Engines
Postgres/DuckDB/SQL Server/SparkSQL:
EXISTS/NOT EXISTSfully supported; optimizers often rewrite these to semi/anti joins automatically.NOT INcaveat: If the subquery yields a NULL, the predicate becomes UNKNOWN → may return zero rows. PreferNOT EXISTS.SparkSQL: has
LEFT SEMIandLEFT ANTIkeywords (sugar forEXISTS/NOT EXISTS).BigQuery/Snowflake: no
SEMI/ANTIkeywords, butEXISTS/NOT EXISTSandLEFT JOIN … IS NULLare optimised similarly.
What’s Next?
Day 05: Aggregations & HAVING
We’ll group rows, compute metrics, handle NULLs carefully, and learn when to filter with HAVING vs WHERE, plus production tips to avoid double counting in aggregated joins.
So see you for the next one and… Happy Coding!


