Day 04 of SQL for Data Engineers: JOINS 201

Part of Week 1: Query Essentials

Sep 04, 2025

Welcome to day 4 of the series where we’ll be covering the existence operators that data engineers use every day to answer “does a match exist?” without duplicating rows. We’ll cover EXISTS/NOT EXISTS, show why NOT IN can be dangerous with NULLs, and map these concepts to LEFT SEMI/LEFT ANTI joins (common in Spark). You’ll also use them for “customers without orders,” CDC/new rows, and deduping without window functions.

TL;DR

Semi join (= EXISTS): return left rows that have at least one match on the right—no duplication.
Anti join (= NOT EXISTS): return left rows that have no match on the right.
Avoid NOT IN when the subquery may produce NULLs; it can silently return no rows. Use NOT EXISTS instead.
Spark has keywords: LEFT SEMI JOIN and LEFT ANTI JOIN, equivalent to EXISTS / NOT EXISTS.

JOIN Patterns Pt. 2

Semi join with EXISTS - Keep rows that have a match

EXISTS answers “does at least one matching row exist?” It does not duplicate left rows even if many matches exist on the right.

-- Customers who placed orders in July 2025 (at least one)
SELECT c.customer_id, c.email
FROM analytics.customers c
WHERE EXISTS (
  SELECT 1
  FROM analytics.orders o
  WHERE o.customer_id = c.customer_id
    AND o.order_ts >= TIMESTAMP '2025-07-01'
    AND o.order_ts <  TIMESTAMP '2025-08-01'
);

Anti join with NOT EXISTS - keep rows with no match

NOT EXISTS returns left rows for which no matching row exists on the right—great for gap analysis.

-- Customers with NO orders in July 2025
SELECT c.customer_id, c.email
FROM analytics.customers c
WHERE NOT EXISTS (
  SELECT 1
  FROM analytics.orders o
  WHERE o.customer_id = c.customer_id
    AND o.order_ts >= TIMESTAMP '2025-07-01'
    AND o.order_ts <  TIMESTAMP '2025-08-01'
);

Why NOT IN Coupled with NULLS can Cause Issues

If the subquery behind NOT IN produces a single NULL, SQL’s three-valued logic can make the entire NOT IN predicate unknown, returning zero rows unexpectedly.

-- Risky if subquery can return NULLs
SELECT c.customer_id
FROM analytics.customers c
WHERE c.customer_id NOT IN (
  SELECT o.customer_id    -- if any NULL here → predicate becomes UNKNOWN
  FROM analytics.orders o
);

Safe alternatives

Use NOT EXISTS (preferred), or
Filter out NULLs explicitly: WHERE c.customer_id NOT IN (SELECT o.customer_id FROM ... WHERE o.customer_id IS NOT NULL)

When in doubt, use NOT EXISTS.

Anti join via LEFT JOIN → IS NULL

Same logic as NOT EXISTS, expressed with a join: join and then keep rows where no match occurred.

-- Customers with NO July orders (anti join form)
SELECT c.customer_id, c.email
FROM analytics.customers c
LEFT JOIN analytics.orders o
  ON  o.customer_id = c.customer_id
  AND o.order_ts >= TIMESTAMP '2025-07-01'
  AND o.order_ts <  TIMESTAMP '2025-08-01'
WHERE o.customer_id IS NULL;

Equivalent to NOT EXISTS; pick the style your team prefers.

SparkSQL Keywords - LEFT SEMI/LEFT ANTI

Spark exposes semi/anti joins directly—handy for readability and performance on big datasets.

-- Spark: customers with July orders (semi)
SELECT c.*
FROM analytics.customers c
LEFT SEMI JOIN analytics.orders o
  ON  o.customer_id = c.customer_id
  AND o.order_ts >= DATE '2025-07-01'
  AND o.order_ts <  DATE '2025-08-01';

-- Spark: customers without July orders (anti)
SELECT c.*
FROM analytics.customers c
LEFT ANTI JOIN analytics.orders o
  ON  o.customer_id = c.customer_id
  AND o.order_ts >= DATE '2025-07-01'
  AND o.order_ts <  DATE '2025-08-01';

CDC/new rows only with NOT EXISTS

Load only records that your target (Gold) doesn’t have yet, idempotent and incremental. A major factor for inline intraday reporting and event based loading.

-- New orders not yet in gold.fct_order
SELECT o.*
FROM analytics.orders o
WHERE NOT EXISTS (
  SELECT 1
  FROM gold.fct_order f
  WHERE f.order_id = o.order_id
);

Spice it up with a watermark for time-bounded loads:

AND o.order_ts > (SELECT COALESCE(last_order_ts, TIMESTAMP '1900-01-01') FROM ops.high_watermark)

Data Engineering Wisdom

Prefer EXISTS/NOT EXISTS for boolean membership tests—stable and optimizer-friendly.
For “without” logic, resist NOT IN unless you’ve eliminated NULLs in the subquery.
In outer joins, keep right-side filters in ON when you intend to preserve left rows (then use IS NULL for anti).
In CDC jobs, pair NOT EXISTS with a watermark for idempotent re-runs.

Mini-exercises (5–10 minutes)

July buyers (semi): Return customer_id of customers who placed at least one order in July 2025 using EXISTS.
- Expected: customers 1 and 2.
No July orders (anti): Return all customers with no July 2025 orders using NOT EXISTS.
- Expected: customers 3 and 4.
Ever ordered? (anti join form): Using a LEFT JOIN … IS NULL, list customers who have never placed any order.
- Expected: customer 4.
Latest per order_id (no windows): Using the self-anti pattern, keep only the latest row per order_id.
- Expected: with the seed set (one row per order), returns all orders unchanged.

Knowledge Check - Day 04 + Day 03 Answers

Answers to yesterdays quiz:

Q1: Choice 2

Q2: Choice 2

Tips for Different Engines

Postgres/DuckDB/SQL Server/SparkSQL: EXISTS/NOT EXISTS fully supported; optimizers often rewrite these to semi/anti joins automatically.
NOT IN caveat: If the subquery yields a NULL, the predicate becomes UNKNOWN → may return zero rows. Prefer NOT EXISTS.
SparkSQL: has LEFT SEMI and LEFT ANTI keywords (sugar for EXISTS/NOT EXISTS).
BigQuery/Snowflake: no SEMI/ANTI keywords, but EXISTS/NOT EXISTS and LEFT JOIN … IS NULL are optimised similarly.

What’s Next?

Day 05: Aggregations & HAVING

We’ll group rows, compute metrics, handle NULLs carefully, and learn when to filter with HAVING vs WHERE, plus production tips to avoid double counting in aggregated joins.

So see you for the next one and… Happy Coding!

Data Bytes & Insights

Discussion about this post