Day 03 of SQL for Data Engineers: JOINS 101
Part of Week 1: Query Essentials
Today we’ll learn how to combine tables using JOIN. You’ll see when to use INNER, LEFT, RIGHT, and FULL joins, how nulls expand in outer joins, where to put filters (ON vs WHERE), and how to avoid cartesian explosions and double counting.
TL;DR
INNER JOIN keeps only matching rows in both tables.
LEFT JOIN keeps all left rows; unmatched right columns become NULL.
RIGHT JOIN is the mirror of LEFT (less commonly used).
FULL OUTER JOIN returns matches + non-matches from both sides.
Put filters on the joined (right) table in the
ONclause if you want to preserve unmatched rows in a LEFT/FULL join.Watch 1-to-many joins → use
COUNT(DISTINCTx)or pre-aggregate to avoid double counts.
JOIN Patterns
INNER JOIN - Keep only matches
Use when the row must exist in both tables. Great for facts with guaranteed dimensions.
-- Orders in July 2025 with customer country
SELECT o.order_id, o.order_ts, c.country_code
FROM analytics.orders AS o
INNER JOIN analytics.customers AS c
ON c.customer_id = o.customer_id
WHERE o.order_ts >= TIMESTAMP '2025-07-01'
AND o.order_ts < TIMESTAMP '2025-08-01';If a customer has no order (or an order has no customer), that row is dropped.
LEFT JOIN - Keep all left rows (null expansion)
Use when you want every row from the left table, even if there’s no match on the right. Unmatched right columns become NULL (this is called null expansion).
-- All customers, with their July 2025 order_id if any
SELECT c.customer_id, c.email, o.order_id
FROM analytics.customers AS c
LEFT JOIN analytics.orders AS o
ON o.customer_id = c.customer_id
AND o.order_ts >= TIMESTAMP '2025-07-01'
AND o.order_ts < TIMESTAMP '2025-08-01'; -- to preserve non-order customersWhy filter inside the ON?
If you move those date filters into WHERE, you’ll remove the NULL-extended rows and accidentally turn the LEFT JOIN into an INNER JOIN.
-- Common mistake: filters in WHERE undo the LEFT JOIN
SELECT c.customer_id, o.order_id
FROM analytics.customers c
LEFT JOIN analytics.orders o
ON o.customer_id = c.customer_id
WHERE o.order_ts >= TIMESTAMP '2025-07-01' -- this kills the nulls
AND o.order_ts < TIMESTAMP '2025-08-01';For LEFT/FULL joins, put filters on the right table inside ON to keep unmatched left rows.
RIGHT JOIN - Mirror image of LEFT
Keeps every row from the right table; unmatched left columns are NULL. It’s equivalent to swapping table order in a LEFT JOIN.
-- All July orders with customer country if present
SELECT o.order_id, c.country_code
FROM analytics.customers AS c
RIGHT JOIN analytics.orders AS o
ON o.customer_id = c.customer_id
AND o.order_ts >= TIMESTAMP '2025-07-01'
AND o.order_ts < TIMESTAMP '2025-08-01';FULL OUTER JOIN - See everything (matches + non-matches)
Use for comparisons and audits: returns matching pairs plus non-matches on both sides. Helpful to find orphans and count diffs.
-- Compare customers vs July orders: who matched, who didn't
SELECT
COALESCE(c.customer_id, o.customer_id) AS customer_id,
CASE
WHEN c.customer_id IS NOT NULL AND o.customer_id IS NOT NULL THEN 'match'
WHEN c.customer_id IS NOT NULL AND o.customer_id IS NULL THEN 'customer_only'
WHEN c.customer_id IS NULL AND o.customer_id IS NOT NULL THEN 'order_only'
END AS match_status
FROM analytics.customers c
FULL OUTER JOIN (
SELECT DISTINCT customer_id
FROM analytics.orders
WHERE order_ts >= TIMESTAMP '2025-07-01'
AND order_ts < TIMESTAMP '2025-08-01'
) o
ON o.customer_id = c.customer_id;Avoid cartesian products (accidental cross joins)
If you forget the join condition (or it’s wrong), you multiply rows (N×M explosion). This tanks performance and corrupts counts.
-- Missing/incorrect ON condition → explosion
SELECT *
FROM analytics.customers c
JOIN analytics.orders o; -- no ON! Cross-joins every customer to every order
Recommended Guardrails
Always specify key = key in
ON.Sanity-check row counts after joins (expectation vs result).
For 1-to-many joins, pre-aggregate the “many” side if you only need one row per key.
Dealing with Duplicates within a JOIN
Joining orders (1) to order_items (many) duplicates order rows. Use COUNT(DISTINCT x) or aggregate first.
-- Pre-aggregate items, then join → one row per order
WITH item_amt AS (
SELECT order_id, SUM(quantity * unit_price_dec) AS gross_amount
FROM analytics.order_items
GROUP BY order_id
)
SELECT o.order_id, ia.gross_amount
FROM analytics.orders o
LEFT JOIN item_amt ia
ON ia.order_id = o.order_id;Data Engineering Wisdom
Define and document join keys (natural vs surrogate) in your team style guide.
For LEFT/FULL joins, put right-side filters in
ON, notWHERE.Add row-count assertions after joins in tests (e.g., “no more than 1 row per order”).
Mini-exercises (5–10 minutes)
Inner join basics: List
order_id, emailfor all July 2025 orders using an INNER JOIN to customers.Expected: orders
1001, 1002, 1003with emails from customers 1 & 2.
Left join non-buyers: Using a LEFT JOIN, return all customers with a flag
has_july_order(Y/N). Put the date filter in the ON clause.Expected: customers 1–4; only 1 and 2 should be
Y.
Find orphans (FULL/LEFT anti): Return customers who have no orders at all (any month). Use LEFT JOIN +
WHERE o.customer_id IS NULLor FULL OUTER JOIN with amatch_status.Expected (with seed data): customer
4has orders? (No July orders; check other months in your data.) In the Day-1 seed only July/Aug are present, so likely customer 4 is an orphan.
Double-count trap: Join
orderstoorder_items(if you created it). Show howCOUNT(*)vsCOUNT(DISTINCT o.order_id)differ. Then fix by pre-aggregating.
Knowledge Check - Day 03 + Day 02 Answers
Answers to yesterdays quiz:
Q1: Choice 2
Q2: Choice 2
Q3: Choice 1
Tips for Different Engines
Postgres/DuckDB/SQL Server/SparkSQL: support INNER, LEFT, RIGHT; FULL OUTER is supported in these engines.
If your engine lacks FULL OUTER (e.g., some MySQL setups), emulate with
LEFT JOIN … WHERE right.key IS NULLUNIONRIGHT JOIN … WHERE left.key IS NULL, or with aUNIONof two anti-joins.USINGvsON:USING(key)is shorthand when both tables share the same column name;ONis more explicit and flexible.
What’s Next?
Day 04: JOINS 201
We’ll be going over extended coverage of JOINS: EXISTS/NOT EXISTS, LEFT ANTI, and LEFT SEMI patterns to answer “does it exist?” without duplicating rows perfect for best practice de-duping, “customers without orders,” and change-data filtering.
See you for joins part 2, until then… Happy Coding!


