Day 05 of SQL for Data Engineers: Aggregations & HAVING

Part of Week 1: Query Essentials

Sep 05, 2025

With Day 5, we’re looking to turn lots of rows into answers. We’ll be covering how GROUP BY forms groups, how aggregate functions (like COUNT/SUM) work, especially with NULLs, and when to use HAVING vs WHERE. We’ll also cover safe patterns to avoid double-counting and to get zeroes where you expect them in outer joins.

TL;DR

GROUP BY makes groups; aggregates compute one row per group.
WHERE filters before grouping; HAVING filters after aggregation.
COUNT(col) ignores NULLs; COUNT(*) counts rows.
In LEFT JOINs, use COUNT(right.key) (not COUNT(*)) to get zero for “no match.”
For “completed vs cancelled” tallies, use conditional aggregation (e.g., SUM(CASE WHEN x THEN 1 END)).

Aggregation Patterns

Grouping Basics

GROUP BY collects rows by a key; each aggregate returns a single value per group.

-- July 2025 orders per customer
SELECT o.customer_id, COUNT(*) AS orders_in_july
FROM analytics.orders o
WHERE o.order_ts >= TIMESTAMP '2025-07-01'
  AND o.order_ts <  TIMESTAMP '2025-08-01'
GROUP BY o.customer_id
ORDER BY orders_in_july DESC, o.customer_id;

With the dummy data from day 01:

customer 1 → 2 orders (1001, 1002)
customer 2 → 1 order (1003)

WHERE narrows the rows first; then GROUP BY + aggregates to summarise.

COUNT(*) vs COUNT(column) - NULLs really do matter

Quick bullets to understand the two count variations:

COUNT(*) counts rows (ignores NULL status).
COUNT(column) counts non-NULL values in that column.

-- All customers grouped by country, counting rows
SELECT c.country_code, COUNT(*) AS customers_in_country
FROM analytics.customers c
GROUP BY c.country_code;

In a nutshell, Use COUNT(*) for row counts, COUNT(column) for present values (NULLs excluded).

Get Zeroes from a LEFT JOIN (null-aware counts)

When you LEFT JOIN and want “countries with zero July orders,” count a right-side column that is NULL for non-matches. If you used COUNT(*), you’d count the customer rows even when there are no orders.

-- July orders by customer country (0 when no orders)
SELECT
  c.country_code,
  COUNT(o.order_id) AS july_orders   -- counts only matched rows
FROM analytics.customers c
LEFT JOIN analytics.orders o
  ON  o.customer_id = c.customer_id
  AND o.order_ts >= TIMESTAMP '2025-07-01'
  AND o.order_ts <  TIMESTAMP '2025-08-01'
GROUP BY c.country_code
ORDER BY july_orders DESC, c.country_code;

Conditional Aggregations - Totals by conditions

Compute multiple metrics in one pass with conditional sums.

-- Completed vs cancelled orders in July (overall)
SELECT
  COUNT(*) AS orders_total,
  SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) AS orders_completed,
  SUM(CASE WHEN status = 'cancelled' THEN 1 ELSE 0 END) AS orders_cancelled
FROM analytics.orders
WHERE order_ts >= TIMESTAMP '2025-07-01'
  AND order_ts <  TIMESTAMP '2025-08-01';

Postgres/DuckDB also support COUNT(*) FILTER (WHERE status='completed'); use CASE in SQL Server/Spark.

The takeaway from this is essentially, conditional aggregation = fast, expressive dashboards in one query.

HAVING vs WHERE

Use WHERE for rows and HAVING for groups (e.g., “countries with ≥ 2 July orders”).

-- Countries with 2+ July orders
SELECT c.country_code, COUNT(o.order_id) AS july_orders
FROM analytics.customers c
LEFT JOIN analytics.orders o
  ON  o.customer_id = c.customer_id
  AND o.order_ts >= TIMESTAMP '2025-07-01'
  AND o.order_ts <  TIMESTAMP '2025-08-01'
GROUP BY c.country_code
HAVING COUNT(o.order_id) >= 2
ORDER BY july_orders DESC;

If your predicate references an aggregate (COUNT/SUM/AVG…), it belongs in HAVING not in a WHERE.

Using DISTINCT & Double Count Traps

Sometimes joins can duplicate left rows. What you should do to avoid this is use COUNT(DISTINCT key) when you mean unique entities, or pre-aggregate the many-side beforehand.

-- Distinct July buyers (unique customers with ≥1 order)
SELECT COUNT(DISTINCT o.customer_id) AS july_buyers
FROM analytics.orders o
WHERE o.order_ts >= TIMESTAMP '2025-07-01'
  AND o.order_ts <  TIMESTAMP '2025-08-01';

Data Engineering Wisdom

Name your groups clearly and keep column order predictable; downstream code often depends on position.
In LEFT JOIN summaries, count a right-side key to get zeroes for non-matches.
Use conditional aggregation for multi-metric reports; avoid multiple scans.
Add row-count assertions to tests (e.g., “distinct buyers this month ≥ last month”).
Watch COUNT(DISTINCT …) on very large data—it can be expensive; pre-aggregate or use approximate functions where available.

Mini-exercises (5–10 minutes)

Orders per customer (July): List customer_id, orders_in_july and keep only customers with ≥ 1 order.
- Hint: WHERE to limit to July, GROUP BY customer_id, HAVING COUNT(*) >= 1.
- Expected: customer 1 → 2, customer 2 → 1.
Country-level completions: For each country, return completed_july_orders using conditional aggregation on a LEFT JOIN with the date filter in ON.
- Expected: UK = 2, ES = 0, FR = 0.

Knowledge Check - Day 05 + Day 04 Answers

Answers to yesterdays quiz:

Q1: Choice 2
Q2: Choice 2
Q3: Choice 2

Tips for Different Engines

Postgres/DuckDB: support FILTER (WHERE …) on aggregates; also GROUPING SETS/ROLLUP if you want subtotals.
SQL Server: use SUM(CASE WHEN … THEN 1 ELSE 0 END) for conditional counts; ROLLUP/CUBE available.
SparkSQL: use conditional aggregation with CASE WHEN; ROLLUP supported via GROUPING SETS.

What’s Next?

Day 06: CASE Expressions & Business Rules

We’ll cover how to turn messy reality into tidy columns: bucket dates, build flags, replace magic numbers with lookup tables, and keep rules consistent across Bronze → Silver → Gold. A good solid introduction to working with the medallion architecture from a SQL standpoint.

See you for the next one, and happy coding!

Data Bytes & Insights

Discussion about this post