Day 08 of SQL for Data Engineers: Dates, Time Zones & Watermarks
Part of Week 2: Physical Design & Storage
Today we’ll kick off week 2 of the series by making time your friend. You’ll learn date math, safe time-window filters, how to store in UTC and display in local time, and how to run incremental loads with a high-watermark so your pipelines are fast and idempotent.
TL;DR
Use half-open windows:
ts >= start AND ts < next_start.Store timestamps in UTC; convert only for display.
Derive date/month columns for partitioning (
order_date,order_month).Watermarks let you load “only new/changed rows” safely and repeatably.
The Patterns
Derive date & month once (for grouping/partitioning)
Don’t keep calling functions in every query. Persist helpful columns for grouping and partition pruning.
-- Create helpful derived columns (as a view or materialised column)
SELECT
order_id,
order_ts, -- UTC timestamp
CAST(order_ts AS DATE) AS order_date, -- day grain
DATE_TRUNC('month', order_ts) AS order_month -- month grain
FROM analytics.orders;Half-open time windows (no off-by-one bugs)
The beginner-proof pattern for time ranges is inclusive start, exclusive end.
-- All of July 2025
WHERE order_ts >= TIMESTAMP '2025-07-01 00:00:00'
AND order_ts < TIMESTAMP '2025-08-01 00:00:00'This works for hours, days, months… just pick the next boundary.
UTC in storage, local for display
Pipelines should not depend on local time or DST changes. Keep storage UTC; only convert for local.
-- Show London time for a UTC column Postgres / DuckDB / Spark style
SELECT
order_id,
order_ts, -- stored UTC
order_ts AT TIME ZONE 'Europe/London' AS order_ts_london -- display in local time
FROM analytics.orders;
-- SQL Server
SELECT
order_id,
order_ts, -- DATETIME2 assumed UTC
order_ts AT TIME ZONE 'UTC' AT TIME ZONE 'GMT Standard Time' AS order_ts_london
FROM analytics.orders;
Date math & truncation (portable patterns)
Common questions: “start of month,” “days between,” “last day of month,” “rolling N days.”
-- Start of the month containing order_ts
SELECT DATE_TRUNC('month', order_ts) AS month_start FROM analytics.orders;
-- Days since customer created (Postgres/DuckDB)
SELECT customer_id, (CURRENT_DATE - CAST(created_at AS DATE)) AS days_since_signup
FROM analytics.customers;
-- Rolling 7 days window (half-open)
WHERE order_ts >= CURRENT_TIMESTAMP - INTERVAL '7 days'
AND order_ts < CURRENT_TIMESTAMPHigh-watermark loads (incremental & idempotent)
A watermark records the last successfully processed timestamp. Each run loads >= last_watermark AND < new_watermark. If the job reruns, results stay the same with no duplicates.
-- 1) Watermark table (one row per feed)
CREATE SCHEMA IF NOT EXISTS ops;
CREATE TABLE IF NOT EXISTS ops.high_watermark (
name VARCHAR PRIMARY KEY,
last_ts TIMESTAMP
);
-- Seed (run once, per feed)
INSERT INTO ops.high_watermark(name, last_ts)
VALUES ('orders', TIMESTAMP '1900-01-01')
ON CONFLICT (name) DO NOTHING; -- Postgres/DuckDB; use MERGE/IF NOT EXISTS in others
-- 2) Pick a new watermark boundary (e.g., max seen or 'now' rounded)
WITH bounds AS (
SELECT
(SELECT last_ts FROM ops.high_watermark WHERE name='orders') AS last_wm,
DATE_TRUNC('minute', CURRENT_TIMESTAMP) AS new_wm
),
src AS (
SELECT *
FROM analytics.orders, bounds
WHERE order_ts >= bounds.last_wm
AND order_ts < bounds.new_wm
)
-- 3) Load src into your target (staging/merge) here…
-- 4) Advance the watermark AFTER a successful load
UPDATE ops.high_watermark
SET last_ts = (SELECT new_wm FROM bounds)
WHERE name = 'orders';Data Engineering Wisdom
Persist
order_date/order_month; partition facts on one of them.Never hard-code “end of day” as
23:59:59.999; always use half-open windows.Keep UTC in storage; add display views for business time zones.
Record min_ts/max_ts/row_count per batch in an audit table with the watermark you used.
For high-volume feeds, overlap the last hour/day and MERGE by key to tolerate late arrivals.
Mini-exercises (5–10 minutes)
Daily order counts (July 2025): Return
order_date, orders_countfor all July days (include days with 0 by generating the date series if your engine supports it).Expected (seed data): counts on 2025-07-01, 2025-07-15, 2025-07-31 only.
Localised display: Select
order_id, order_tsand a London local column.Check:
2025-07-01 09:00 UTCshould display as10:00London (BST) in July.
Tips for Different Engines
Postgres/DuckDB/SparkSQL:
DATE_TRUNC('month', ts),CAST(ts AS DATE), intervals likeINTERVAL '7 days', andAT TIME ZONE 'Europe/London'. Spark usesfrom_utc_timestamp(ts,'Europe/London')/to_utc_timestamp(ts,'Europe/London').SQL Server:
DATEADD/DATEDIFFfor truncation;AT TIME ZONE 'UTC'→'GMT Standard Time'for London;EOMONTH()helps with boundaries; useDATETIME2.
What’s Next?
Day 09: Databases, Schemas & Namespaces
We’ll design clean environment layouts (dev/test/prod), set up schemas, and apply least-privilege grants so your medallion layers are organised and secure.
So see you for the next one and… Happy Coding!


