This 30-day series teaches SQL from a data-engineering angle: clear queries, production-safe patterns, and medallion-friendly habits. Apologies if you’ve already been through this already, I decided to retcon the previous series as the reliance on Microsoft Fabric overshowed the content I was trying to provide around SQL.
Today you’ll: set up a free SQL environment, create tiny sample tables, and master explicit SELECT
, clean WHERE
filters, and half-open date ranges you’ll reuse all month.
TL;DR
Avoid
SELECT *
in pipeline code, meaning be explicit with columns.Filter time with half-open windows:
>= start AND < next_start
.Treat NULL and blank strings as missing, consistently.
Prefer deterministic queries you can re-run and test.
Quick, free SQL setup (use this if you’ve nothing installed)
Already have Postgres / SQL Server / Spark? Stick to that. Otherwise:
DuckDB
Python:
pip install duckdb
→ runpython -m duckdb
(REPL).Or download the standalone from duckdb.org (no admin rights).
Everything below runs as-is (ANSI-first); see dialect notes at the end.
Let’s Create a Small Dataset
CREATE SCHEMA IF NOT EXISTS analytics;
CREATE OR REPLACE TABLE analytics.customers (
customer_id BIGINT PRIMARY KEY,
email VARCHAR,
first_name VARCHAR,
last_name VARCHAR,
created_at TIMESTAMP,
country_code VARCHAR
);
CREATE OR REPLACE TABLE analytics.orders (
order_id BIGINT PRIMARY KEY,
customer_id BIGINT,
order_ts TIMESTAMP,
status VARCHAR,
site VARCHAR
);
INSERT INTO analytics.customers VALUES
(1,'amy@example.com','Amy','Tall','2025-01-04 10:15:00','UK'),
(2,' ','Ben','Mann','2025-02-12 08:00:00','UK'), -- blank email
(3,NULL,'Hari','Seldon','2025-03-05 12:30:00','ES'),
(4,'dave@example.com','Dave','smith','2025-07-19 09:45:00','FR');
INSERT INTO analytics.orders VALUES
(1001,1,'2025-07-01 09:00:00','completed','dudcorp'),
(1002,1,'2025-07-15 14:12:00','cancelled','dudcorp'),
(1003,2,'2025-07-31 23:59:59','completed','dudcorp'),
(1004,3,'2025-08-01 00:00:00','completed','dudcorp');
Essential Query Patterns
Selecting specific columns from your table:
SELECT customer_id, email, created_at
FROM analytics.customers;
Precise filtering on specific values:
-- UK customers created in 2025
SELECT customer_id, email, created_at
FROM analytics.customers
WHERE country_code = 'UK'
AND created_at >= TIMESTAMP '2025-01-01 00:00:00'
AND created_at < TIMESTAMP '2026-01-01 00:00:00';
Half-open monthly:
-- Orders in July 2025 (inclusive of 1 July, exclusive of 1 Aug)
SELECT order_id, customer_id, order_ts, status
FROM analytics.orders
WHERE order_ts >= TIMESTAMP '2025-07-01 00:00:00'
AND order_ts < TIMESTAMP '2025-08-01 00:00:00';
Half-open means the start is included and the end is excluded — written as [start, end]. In SQL: use ts >= start AND ts < next_boundary
for exact windows without off-by-one bugs.
Null-safe & blank-safe checks:
-- Customers with a real email (not NULL, not just spaces)
SELECT customer_id, email
FROM analytics.customers
WHERE email IS NOT NULL
AND TRIM(email) <> '';
Set filters & negation:
SELECT order_id, status, site
FROM analytics.orders
WHERE status <> 'cancelled'
AND site IN ('dudcorp');
Data Engineering Wisdom (habits to adopt)
No
*
in pipelines. Version your column list like any other interface.Half-open time. Always use
[start, next_start)
—it’s partition- and test-friendly.Define “missing”. Decide if blanks count as missing and enforce early (Bronze).
Mini-exercises (5–10 minutes)
March UK sign-ups: Return
customer_id, email, created_at
for UK customers created in March 2025 (half-open month range).Completed July orders (UK site):
order_id, customer_id, order_ts
forstatus='completed'
onjdsportsuk
in July 2025.Missing emails: All customers where email is NULL or whitespace only.
(Expected with above data: #1 → 0 rows, #2 → 2 rows (1001, 1003), #3 → 2 rows (ids 2,3).)
Knowledge Check - Day 01
Tips for Different Engines
Postgres: shown syntax works;
NULLIF(TRIM(email),'') IS NULL
is a tidy checker.SQL Server: use
DATETIME2
literals viaCAST('…' AS DATETIME2)
; TRIM requires 2017+, elseLTRIM(RTRIM(email))
.SparkSQL: prefer filtering on partition columns (e.g.,
order_date
) for pruning.
What’s next?
Day 02: Sorting & Paging (ORDER BY & LIMIT):
We’ll cover making results deterministic with stable ORDER BY
, add tie-breakers, and learn when OFFSET
hurts. Plus safer paging patterns you can scale.
So see you for the next one and as usual… Happy coding!