Day 03 of SQL for Data Engineers: JOINS 101

Part of Week 1: Query Essentials

Sep 03, 2025

Today we’ll learn how to combine tables using JOIN. You’ll see when to use INNER, LEFT, RIGHT, and FULL joins, how nulls expand in outer joins, where to put filters (ON vs WHERE), and how to avoid cartesian explosions and double counting.

TL;DR

INNER JOIN keeps only matching rows in both tables.
LEFT JOIN keeps all left rows; unmatched right columns become NULL.
RIGHT JOIN is the mirror of LEFT (less commonly used).
FULL OUTER JOIN returns matches + non-matches from both sides.
Put filters on the joined (right) table in the ON clause if you want to preserve unmatched rows in a LEFT/FULL join.
Watch 1-to-many joins → use COUNT(DISTINCT x) or pre-aggregate to avoid double counts.

JOIN Patterns

INNER JOIN - Keep only matches

Use when the row must exist in both tables. Great for facts with guaranteed dimensions.

-- Orders in July 2025 with customer country
SELECT o.order_id, o.order_ts, c.country_code
FROM analytics.orders  AS o
INNER JOIN analytics.customers AS c
  ON c.customer_id = o.customer_id
WHERE o.order_ts >= TIMESTAMP '2025-07-01'
  AND o.order_ts <  TIMESTAMP '2025-08-01';

If a customer has no order (or an order has no customer), that row is dropped.

LEFT JOIN - Keep all left rows (null expansion)

Use when you want every row from the left table, even if there’s no match on the right. Unmatched right columns become NULL (this is called null expansion).

-- All customers, with their July 2025 order_id if any
SELECT c.customer_id, c.email, o.order_id
FROM analytics.customers AS c
LEFT JOIN analytics.orders AS o
  ON  o.customer_id = c.customer_id
  AND o.order_ts >= TIMESTAMP '2025-07-01'
  AND o.order_ts <  TIMESTAMP '2025-08-01';  -- to preserve non-order customers

Why filter inside the ON?

If you move those date filters into WHERE, you’ll remove the NULL-extended rows and accidentally turn the LEFT JOIN into an INNER JOIN.

-- Common mistake: filters in WHERE undo the LEFT JOIN
SELECT c.customer_id, o.order_id
FROM analytics.customers c
LEFT JOIN analytics.orders o
  ON o.customer_id = c.customer_id
WHERE o.order_ts >= TIMESTAMP '2025-07-01'   -- this kills the nulls
  AND o.order_ts <  TIMESTAMP '2025-08-01';

For LEFT/FULL joins, put filters on the right table inside ON to keep unmatched left rows.

RIGHT JOIN - Mirror image of LEFT

Keeps every row from the right table; unmatched left columns are NULL. It’s equivalent to swapping table order in a LEFT JOIN.

-- All July orders with customer country if present
SELECT o.order_id, c.country_code
FROM analytics.customers AS c
RIGHT JOIN analytics.orders AS o
  ON  o.customer_id = c.customer_id
  AND o.order_ts >= TIMESTAMP '2025-07-01'
  AND o.order_ts <  TIMESTAMP '2025-08-01';

FULL OUTER JOIN - See everything (matches + non-matches)

Use for comparisons and audits: returns matching pairs plus non-matches on both sides. Helpful to find orphans and count diffs.

-- Compare customers vs July orders: who matched, who didn't
SELECT
  COALESCE(c.customer_id, o.customer_id) AS customer_id,
  CASE
    WHEN c.customer_id IS NOT NULL AND o.customer_id IS NOT NULL THEN 'match'
    WHEN c.customer_id IS NOT NULL AND o.customer_id IS NULL  THEN 'customer_only'
    WHEN c.customer_id IS NULL  AND o.customer_id IS NOT NULL THEN 'order_only'
  END AS match_status
FROM analytics.customers c
FULL OUTER JOIN (
  SELECT DISTINCT customer_id
  FROM analytics.orders
  WHERE order_ts >= TIMESTAMP '2025-07-01'
    AND order_ts <  TIMESTAMP '2025-08-01'
) o
  ON o.customer_id = c.customer_id;

Avoid cartesian products (accidental cross joins)

If you forget the join condition (or it’s wrong), you multiply rows (N×M explosion). This tanks performance and corrupts counts.

-- Missing/incorrect ON condition → explosion
SELECT *
FROM analytics.customers c
JOIN analytics.orders o;  -- no ON! Cross-joins every customer to every order

Recommended Guardrails

Always specify key = key in ON.
Sanity-check row counts after joins (expectation vs result).
For 1-to-many joins, pre-aggregate the “many” side if you only need one row per key.

Dealing with Duplicates within a JOIN

Joining orders (1) to order_items (many) duplicates order rows. Use COUNT(DISTINCT x) or aggregate first.

-- Pre-aggregate items, then join → one row per order
WITH item_amt AS (
  SELECT order_id, SUM(quantity * unit_price_dec) AS gross_amount
  FROM analytics.order_items
  GROUP BY order_id
)
SELECT o.order_id, ia.gross_amount
FROM analytics.orders o
LEFT JOIN item_amt ia
  ON ia.order_id = o.order_id;

Data Engineering Wisdom

Define and document join keys (natural vs surrogate) in your team style guide.
For LEFT/FULL joins, put right-side filters in ON, not WHERE.
Add row-count assertions after joins in tests (e.g., “no more than 1 row per order”).

Mini-exercises (5–10 minutes)

Inner join basics: List order_id, email for all July 2025 orders using an INNER JOIN to customers.
- Expected: orders 1001, 1002, 1003 with emails from customers 1 & 2.
Left join non-buyers: Using a LEFT JOIN, return all customers with a flag has_july_order (Y/N). Put the date filter in the ON clause.
- Expected: customers 1–4; only 1 and 2 should be Y.
Find orphans (FULL/LEFT anti): Return customers who have no orders at all (any month). Use LEFT JOIN + WHERE o.customer_id IS NULL or FULL OUTER JOIN with a match_status.
- Expected (with seed data): customer 4 has orders? (No July orders; check other months in your data.) In the Day-1 seed only July/Aug are present, so likely customer 4 is an orphan.
Double-count trap: Join orders to order_items (if you created it). Show how COUNT(*) vs COUNT(DISTINCT o.order_id) differ. Then fix by pre-aggregating.

Knowledge Check - Day 03 + Day 02 Answers

Answers to yesterdays quiz:

Q1: Choice 2
Q2: Choice 2
Q3: Choice 1

Tips for Different Engines

Postgres/DuckDB/SQL Server/SparkSQL: support INNER, LEFT, RIGHT; FULL OUTER is supported in these engines.
If your engine lacks FULL OUTER (e.g., some MySQL setups), emulate with LEFT JOIN … WHERE right.key IS NULL UNION RIGHT JOIN … WHERE left.key IS NULL, or with a UNION of two anti-joins.
USING vs ON: USING(key) is shorthand when both tables share the same column name; ON is more explicit and flexible.

What’s Next?

Day 04: JOINS 201

We’ll be going over extended coverage of JOINS: EXISTS/NOT EXISTS, LEFT ANTI, and LEFT SEMI patterns to answer “does it exist?” without duplicating rows perfect for best practice de-duping, “customers without orders,” and change-data filtering.

See you for joins part 2, until then… Happy Coding!

Data Bytes & Insights

Discussion about this post