Day 05 of SQL for Data Engineers: Aggregations & HAVING
Part of Week 1: Query Essentials
With Day 5, we’re looking to turn lots of rows into answers. We’ll be covering how GROUP BY forms groups, how aggregate functions (like COUNT/SUM) work, especially with NULLs, and when to use HAVING vs WHERE. We’ll also cover safe patterns to avoid double-counting and to get zeroes where you expect them in outer joins.
TL;DR
GROUP BYmakes groups; aggregates compute one row per group.WHEREfilters before grouping;HAVINGfilters after aggregation.COUNT(col)ignores NULLs;COUNT(*)counts rows.In LEFT JOINs, use
COUNT(right.key)(notCOUNT(*)) to get zero for “no match.”For “completed vs cancelled” tallies, use conditional aggregation (e.g.,
SUM(CASE WHENxTHEN 1 END)).
Aggregation Patterns
Grouping Basics
GROUP BY collects rows by a key; each aggregate returns a single value per group.
-- July 2025 orders per customer
SELECT o.customer_id, COUNT(*) AS orders_in_july
FROM analytics.orders o
WHERE o.order_ts >= TIMESTAMP '2025-07-01'
AND o.order_ts < TIMESTAMP '2025-08-01'
GROUP BY o.customer_id
ORDER BY orders_in_july DESC, o.customer_id;With the dummy data from day 01:
customer 1 → 2 orders (1001, 1002)
customer 2 → 1 order (1003)
WHERE narrows the rows first; then GROUP BY + aggregates to summarise.
COUNT(*) vs COUNT(column) - NULLs really do matter
Quick bullets to understand the two count variations:
COUNT(*)counts rows (ignores NULL status).COUNT(column)counts non-NULL values in that column.
-- All customers grouped by country, counting rows
SELECT c.country_code, COUNT(*) AS customers_in_country
FROM analytics.customers c
GROUP BY c.country_code;In a nutshell, Use COUNT(*) for row counts, COUNT(column) for present values (NULLs excluded).
Get Zeroes from a LEFT JOIN (null-aware counts)
When you LEFT JOIN and want “countries with zero July orders,” count a right-side column that is NULL for non-matches. If you used COUNT(*), you’d count the customer rows even when there are no orders.
-- July orders by customer country (0 when no orders)
SELECT
c.country_code,
COUNT(o.order_id) AS july_orders -- counts only matched rows
FROM analytics.customers c
LEFT JOIN analytics.orders o
ON o.customer_id = c.customer_id
AND o.order_ts >= TIMESTAMP '2025-07-01'
AND o.order_ts < TIMESTAMP '2025-08-01'
GROUP BY c.country_code
ORDER BY july_orders DESC, c.country_code;Conditional Aggregations - Totals by conditions
Compute multiple metrics in one pass with conditional sums.
-- Completed vs cancelled orders in July (overall)
SELECT
COUNT(*) AS orders_total,
SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) AS orders_completed,
SUM(CASE WHEN status = 'cancelled' THEN 1 ELSE 0 END) AS orders_cancelled
FROM analytics.orders
WHERE order_ts >= TIMESTAMP '2025-07-01'
AND order_ts < TIMESTAMP '2025-08-01';Postgres/DuckDB also support
COUNT(*) FILTER (WHERE status='completed'); useCASEin SQL Server/Spark.
The takeaway from this is essentially, conditional aggregation = fast, expressive dashboards in one query.
HAVING vs WHERE
Use WHERE for rows and HAVING for groups (e.g., “countries with ≥ 2 July orders”).
-- Countries with 2+ July orders
SELECT c.country_code, COUNT(o.order_id) AS july_orders
FROM analytics.customers c
LEFT JOIN analytics.orders o
ON o.customer_id = c.customer_id
AND o.order_ts >= TIMESTAMP '2025-07-01'
AND o.order_ts < TIMESTAMP '2025-08-01'
GROUP BY c.country_code
HAVING COUNT(o.order_id) >= 2
ORDER BY july_orders DESC;If your predicate references an aggregate (COUNT/SUM/AVG…), it belongs in HAVING not in a WHERE.
Using DISTINCT & Double Count Traps
Sometimes joins can duplicate left rows. What you should do to avoid this is use COUNT(DISTINCT key) when you mean unique entities, or pre-aggregate the many-side beforehand.
-- Distinct July buyers (unique customers with ≥1 order)
SELECT COUNT(DISTINCT o.customer_id) AS july_buyers
FROM analytics.orders o
WHERE o.order_ts >= TIMESTAMP '2025-07-01'
AND o.order_ts < TIMESTAMP '2025-08-01';Data Engineering Wisdom
Name your groups clearly and keep column order predictable; downstream code often depends on position.
In LEFT JOIN summaries, count a right-side key to get zeroes for non-matches.
Use conditional aggregation for multi-metric reports; avoid multiple scans.
Add row-count assertions to tests (e.g., “distinct buyers this month ≥ last month”).
Watch
COUNT(DISTINCT …)on very large data—it can be expensive; pre-aggregate or use approximate functions where available.
Mini-exercises (5–10 minutes)
Orders per customer (July): List
customer_id, orders_in_julyand keep only customers with ≥ 1 order.Hint:
WHEREto limit to July,GROUP BY customer_id,HAVING COUNT(*) >= 1.Expected: customer 1 → 2, customer 2 → 1.
Country-level completions: For each country, return
completed_july_ordersusing conditional aggregation on a LEFT JOIN with the date filter inON.Expected: UK = 2, ES = 0, FR = 0.
Knowledge Check - Day 05 + Day 04 Answers
Answers to yesterdays quiz:
Q1: Choice 2
Q2: Choice 2
Q3: Choice 2
Tips for Different Engines
Postgres/DuckDB: support
FILTER (WHERE …)on aggregates; alsoGROUPING SETS/ROLLUPif you want subtotals.SQL Server: use
SUM(CASE WHEN … THEN 1 ELSE 0 END)for conditional counts;ROLLUP/CUBEavailable.SparkSQL: use conditional aggregation with
CASE WHEN;ROLLUPsupported viaGROUPING SETS.
What’s Next?
Day 06: CASE Expressions & Business Rules
We’ll cover how to turn messy reality into tidy columns: bucket dates, build flags, replace magic numbers with lookup tables, and keep rules consistent across Bronze → Silver → Gold. A good solid introduction to working with the medallion architecture from a SQL standpoint.
See you for the next one, and happy coding!


