Day 08 of SQL for Data Engineers: Dates, Time Zones & Watermarks

Part of Week 2: Physical Design & Storage

Sep 08, 2025

Today we’ll kick off week 2 of the series by making time your friend. You’ll learn date math, safe time-window filters, how to store in UTC and display in local time, and how to run incremental loads with a high-watermark so your pipelines are fast and idempotent.

TL;DR

Use half-open windows: ts >= start AND ts < next_start.
Store timestamps in UTC; convert only for display.
Derive date/month columns for partitioning (order_date, order_month).
Watermarks let you load “only new/changed rows” safely and repeatably.

The Patterns

Derive date & month once (for grouping/partitioning)

Don’t keep calling functions in every query. Persist helpful columns for grouping and partition pruning.

-- Create helpful derived columns (as a view or materialised column)
SELECT
  order_id,
  order_ts,                                -- UTC timestamp
  CAST(order_ts AS DATE)                    AS order_date,     -- day grain
  DATE_TRUNC('month', order_ts)             AS order_month     -- month grain
FROM analytics.orders;

Half-open time windows (no off-by-one bugs)

The beginner-proof pattern for time ranges is inclusive start, exclusive end.

-- All of July 2025
WHERE order_ts >= TIMESTAMP '2025-07-01 00:00:00'
  AND order_ts <  TIMESTAMP '2025-08-01 00:00:00'

This works for hours, days, months… just pick the next boundary.

UTC in storage, local for display

Pipelines should not depend on local time or DST changes. Keep storage UTC; only convert for local.

-- Show London time for a UTC column Postgres / DuckDB / Spark style
SELECT
  order_id,
  order_ts,                                                -- stored UTC
  order_ts AT TIME ZONE 'Europe/London' AS order_ts_london -- display in local time
FROM analytics.orders;

-- SQL Server
SELECT
  order_id,
  order_ts,                                                   -- DATETIME2 assumed UTC
  order_ts AT TIME ZONE 'UTC' AT TIME ZONE 'GMT Standard Time' AS order_ts_london
FROM analytics.orders;

Date math & truncation (portable patterns)

Common questions: “start of month,” “days between,” “last day of month,” “rolling N days.”

-- Start of the month containing order_ts
SELECT DATE_TRUNC('month', order_ts) AS month_start FROM analytics.orders;

-- Days since customer created (Postgres/DuckDB)
SELECT customer_id, (CURRENT_DATE - CAST(created_at AS DATE)) AS days_since_signup
FROM analytics.customers;

-- Rolling 7 days window (half-open)
WHERE order_ts >= CURRENT_TIMESTAMP - INTERVAL '7 days'
  AND order_ts <  CURRENT_TIMESTAMP

High-watermark loads (incremental & idempotent)

A watermark records the last successfully processed timestamp. Each run loads >= last_watermark AND < new_watermark. If the job reruns, results stay the same with no duplicates.

-- 1) Watermark table (one row per feed)
CREATE SCHEMA IF NOT EXISTS ops;
CREATE TABLE IF NOT EXISTS ops.high_watermark (
  name     VARCHAR PRIMARY KEY,
  last_ts  TIMESTAMP
);

-- Seed (run once, per feed)
INSERT INTO ops.high_watermark(name, last_ts)
VALUES ('orders', TIMESTAMP '1900-01-01')
ON CONFLICT (name) DO NOTHING;  -- Postgres/DuckDB; use MERGE/IF NOT EXISTS in others

-- 2) Pick a new watermark boundary (e.g., max seen or 'now' rounded)
WITH bounds AS (
  SELECT
    (SELECT last_ts FROM ops.high_watermark WHERE name='orders') AS last_wm,
    DATE_TRUNC('minute', CURRENT_TIMESTAMP)                      AS new_wm
),
src AS (
  SELECT *
  FROM analytics.orders, bounds
  WHERE order_ts >= bounds.last_wm
    AND order_ts <  bounds.new_wm
)
-- 3) Load src into your target (staging/merge) here…

-- 4) Advance the watermark AFTER a successful load
UPDATE ops.high_watermark
SET last_ts = (SELECT new_wm FROM bounds)
WHERE name = 'orders';

Persist order_date/order_month; partition facts on one of them.
Never hard-code “end of day” as 23:59:59.999; always use half-open windows.
Keep UTC in storage; add display views for business time zones.
Record min_ts/max_ts/row_count per batch in an audit table with the watermark you used.
For high-volume feeds, overlap the last hour/day and MERGE by key to tolerate late arrivals.

Mini-exercises (5–10 minutes)

Daily order counts (July 2025): Return order_date, orders_count for all July days (include days with 0 by generating the date series if your engine supports it).
- Expected (seed data): counts on 2025-07-01, 2025-07-15, 2025-07-31 only.
Localised display: Select order_id, order_ts and a London local column.
- Check: 2025-07-01 09:00 UTC should display as 10:00 London (BST) in July.

Tips for Different Engines

Postgres/DuckDB/SparkSQL: DATE_TRUNC('month', ts), CAST(ts AS DATE), intervals like INTERVAL '7 days', and AT TIME ZONE 'Europe/London'. Spark uses from_utc_timestamp(ts,'Europe/London') / to_utc_timestamp(ts,'Europe/London').
SQL Server: DATEADD/DATEDIFF for truncation; AT TIME ZONE 'UTC' → 'GMT Standard Time' for London; EOMONTH() helps with boundaries; use DATETIME2.

What’s Next?

Day 09: Databases, Schemas & Namespaces

We’ll design clean environment layouts (dev/test/prod), set up schemas, and apply least-privilege grants so your medallion layers are organised and secure.

So see you for the next one and… Happy Coding!

Data Bytes & Insights

Discussion about this post