Day 01 of SQL for Data Engineers: SELECT ESSENTIALS

Part of Week 1: Query Fundamentals

Sep 01, 2025

This 30-day series teaches SQL from a data-engineering angle: clear queries, production-safe patterns, and medallion-friendly habits. Apologies if you’ve already been through this already, I decided to retcon the previous series as the reliance on Microsoft Fabric overshowed the content I was trying to provide around SQL.

Today you’ll: set up a free SQL environment, create tiny sample tables, and master explicit SELECT, clean WHERE filters, and half-open date ranges you’ll reuse all month.

TL;DR

Avoid SELECT * in pipeline code, meaning be explicit with columns.
Filter time with half-open windows: >= start AND < next_start.
Treat NULL and blank strings as missing, consistently.
Prefer deterministic queries you can re-run and test.

Quick, free SQL setup (use this if you’ve nothing installed)

Already have Postgres / SQL Server / Spark? Stick to that. Otherwise:

DuckDB

Python: pip install duckdb → run python -m duckdb (REPL).
Or download the standalone from duckdb.org (no admin rights).
Everything below runs as-is (ANSI-first); see dialect notes at the end.

Let’s Create a Small Dataset

CREATE SCHEMA IF NOT EXISTS analytics;

CREATE OR REPLACE TABLE analytics.customers (
  customer_id  BIGINT PRIMARY KEY,
  email        VARCHAR,
  first_name   VARCHAR,
  last_name    VARCHAR,
  created_at   TIMESTAMP,
  country_code VARCHAR
);

CREATE OR REPLACE TABLE analytics.orders (
  order_id     BIGINT PRIMARY KEY,
  customer_id  BIGINT,
  order_ts     TIMESTAMP,
  status       VARCHAR,
  site         VARCHAR
);

INSERT INTO analytics.customers VALUES
  (1,'amy@example.com','Amy','Tall','2025-01-04 10:15:00','UK'),
  (2,'  ','Ben','Mann','2025-02-12 08:00:00','UK'),       -- blank email
  (3,NULL,'Hari','Seldon','2025-03-05 12:30:00','ES'),
  (4,'dave@example.com','Dave','smith','2025-07-19 09:45:00','FR');

INSERT INTO analytics.orders VALUES
  (1001,1,'2025-07-01 09:00:00','completed','dudcorp'),
  (1002,1,'2025-07-15 14:12:00','cancelled','dudcorp'),
  (1003,2,'2025-07-31 23:59:59','completed','dudcorp'),
  (1004,3,'2025-08-01 00:00:00','completed','dudcorp');

Essential Query Patterns

Selecting specific columns from your table:

SELECT customer_id, email, created_at
FROM analytics.customers;

Precise filtering on specific values:

-- UK customers created in 2025
SELECT customer_id, email, created_at
FROM analytics.customers
WHERE country_code = 'UK'
  AND created_at >= TIMESTAMP '2025-01-01 00:00:00'
  AND created_at <  TIMESTAMP '2026-01-01 00:00:00';

Half-open monthly:

-- Orders in July 2025 (inclusive of 1 July, exclusive of 1 Aug)
SELECT order_id, customer_id, order_ts, status
FROM analytics.orders
WHERE order_ts >= TIMESTAMP '2025-07-01 00:00:00'
  AND order_ts <  TIMESTAMP '2025-08-01 00:00:00';

Half-open means the start is included and the end is excluded — written as [start, end]. In SQL: use ts >= start AND ts < next_boundary for exact windows without off-by-one bugs.

Null-safe & blank-safe checks:

-- Customers with a real email (not NULL, not just spaces)
SELECT customer_id, email
FROM analytics.customers
WHERE email IS NOT NULL
  AND TRIM(email) <> '';

Set filters & negation:

SELECT order_id, status, site
FROM analytics.orders
WHERE status <> 'cancelled'
  AND site IN ('dudcorp');

Data Engineering Wisdom (habits to adopt)

No * in pipelines. Version your column list like any other interface.
Half-open time. Always use [start, next_start)—it’s partition- and test-friendly.
Define “missing”. Decide if blanks count as missing and enforce early (Bronze).

Mini-exercises (5–10 minutes)

March UK sign-ups: Return customer_id, email, created_at for UK customers created in March 2025 (half-open month range).
Completed July orders (UK site): order_id, customer_id, order_ts for status='completed' on jdsportsuk in July 2025.
Missing emails: All customers where email is NULL or whitespace only.

(Expected with above data: #1 → 0 rows, #2 → 2 rows (1001, 1003), #3 → 2 rows (ids 2,3).)

Knowledge Check - Day 01

Tips for Different Engines

Postgres: shown syntax works; NULLIF(TRIM(email),'') IS NULL is a tidy checker.
SQL Server: use DATETIME2 literals via CAST('…' AS DATETIME2); TRIM requires 2017+, else LTRIM(RTRIM(email)).
SparkSQL: prefer filtering on partition columns (e.g., order_date) for pruning.

What’s next?

Day 02: Sorting & Paging (ORDER BY & LIMIT):

We’ll cover making results deterministic with stable ORDER BY, add tie-breakers, and learn when OFFSET hurts. Plus safer paging patterns you can scale.

So see you for the next one and as usual… Happy coding!

Data Bytes & Insights

Discussion about this post