Byte Insight: Why You Should Adopt a Data Partitioning Strategy in Your Data Engineering Pipelines

Unlock your data platforms ability to query at speed and reduce stakeholder frustration.

Dec 05, 2024

If you’ve worked in big data or even just worked with extremely large datasets you’ve unfortunately had to deal with query performance issues. Now, back in the simple DBA days there had been all kinds of different ways to troubleshoot and reduce query time, like:

Indexing techniques like clustered/non-clustered, composite etc.
Heavy normalisation.
Precise type and size requirements per column within each table.
Views stored on disk.
Execution plan analysis.
No SELECT * in any stored procedures (still frowned upon mind you!)

Most of the above can send shivers down even the most experienced DBAs. But the standout method is to partition data, and still is. We’ll be covering how exactly to do that within your Delta Lake, as well as best practices and management/monitoring your partitions.

The method we will explore is a popular approach that I use myself:

“year=/month=/day=”

Why Partitioning Matters in Delta Lake

Partitioning involves dividing a dataset into smaller, organised chunks based on a specific column or a set of columns. For Delta Lake, partitioning is critical for the following reasons:

Improved Query Performance: Allows for pruning of unnecessary data during queries, reducing I/O and speeding up results.
Cost Efficiency: Efficient queries mean lower compute costs, which is vital for managing large-scale data
Operational Flexibility: It supports incremental data ingestion enabling faster backfills and updates.

If you take a standard query, ran against a directory or table without partitioning, it will scan the entire dataset, regardless of the filters applied, for example; where order_date = “2024-12-05”. However, if you partition your directory or table on the order_date then it will prune unnecessary partitions and only look for the partitions related to order_date.

Reasons to Partition on YYYY/MM/DD

Using the partitioning method “year=/month=/day=” is in my opinion a lot more effective than partitioning on “date=”. It allows for fine-grained partitioning by utilisng a hierarchical structure, making it easier to filter and query. A single data partition creates a flat structure with granular partitions, leading to a larger number of small partitions, as well as a large number of large partitions depending on the volume of data on that day. Looking at the math below, you’d think it hardly makes a difference:

Example: 5 years of data (2020-2024)

Hierarchical Structure:

Total Year Partitions - 5
Total Month Partitions - 5x12 = 60
Total Day Partitions - 5x12x30 = 1,800

Flat Structure:

Total Day Partitions - 5x365 = 1,825

However, when we delve into it, we can see that when partitioned using the hierarchical structure the query engine is able to prune year/month/day. Where as the flat structure cannot prune year or month so must evaluate all date partitions.

Why Does this Matter?

Distributed systems like Hive, Spark, and Databricks, store partition metadata in centralised stores. Query planning time increases with the number of partitions, so the more the engine can prune, the better for stakeholders!

You Should Partition ALL Your Data Lake Layers

Partitioning every layer, even the Bronze layer, is essential for maintaining flexibility in your pipelines. Using the dreaded backfill as an example, we’ll cover off why its necessary at all layers.

Backfilling involves reprocessing of historical data to correct errors or fill in missing data that had either been late to the party or simply missed out due to processing issues. Without partitioning, backfilling becomes compute heavy making it expensive and prone to errors.

Example Scenario: A retail company needs to update sales data for a week due to a reporting error. With YYYY/MM/DD partitioning, only the affected partitions are reprocessed.
Result: The backfill is quick and cost effective, with minimal disruption to ongoing workflows.

There may be instances where you’ll need to backfill multiple layers so laying a partitioning foundation within your data, sets you up for success and less headaches down the line.

Sneak Peek: Monitoring and Maintaining Partitions in an Azure Databricks Solution

I’ve not forgotten about my upcoming blog series "The Blueprint for Building Scalable Data Platforms on Azure”, so I’ve decided to include a small snipped of what the future holds in Parts 3 and 4.

Partitioning is a powerful tool, but it requires active monitoring and maintenance to ensure it continues delivering performance benefits. Here are some best practices with code examples for an Azure Databricks solution:

1. Optimise Partition Sizes

Partitions that are too small or too large can hurt performance:

Small Partitions lead to excessive metadata and slow query planning.
Large Partitions lead to inefficient scans.

Solution: Use Delta Lake’s OPTIMIZE command to compact small files within partitions into more manageable sizes.

spark.sql(""" 
OPTIMIZE delta.`/mnt/datalake/sales_data` ZORDER BY (order_date) 
""")

Best Practice: Aim for partition files between 100 MB and 1 GB.

More Info here: OPTIMIZE - Azure Databricks - Databricks SQL | Microsoft Learn

2. Leverage Delta Lake Partitioning

Delta Lake combines traditional partitioning with advanced optimisations like Z-Ordering:

Partitioning: YYYY/MM/DD.
Z-Ordering: Co-locate frequently queried columns for faster scans.

Example: Partition by year and month, then Z-Order by order_date:

# Create a Delta table partitioned by year and month sales_data_df.write.format("delta") \ 
.partitionBy("year", "month", "day") \ 
.mode("overwrite") \ 
.save("/mnt/datalake/sales_data") 

# Optimise using Z-Order 
spark.sql(""" 
OPTIMIZE delta.`/mnt/datalake/sales_data` ZORDER BY (order_date) 
""")

3. Monitor Query Execution Plans

Use the Databricks Query Profile or Spark UI to check partition pruning.
Example: Ensure only relevant partitions are scanned:

# Enable query execution plan display
spark.conf.set("spark.sql.adaptive.enabled", "true") 

# Analyse a query execution plan

sales_df = spark.sql(""" 
SELECT * FROM delta.`/mnt/datalake/sales_data` WHERE year = 2024 AND month = 12 """) 

sales_df.explain(True)

Look for "PartitionFilters" in the output to confirm pruning is happening.

4. Periodically Reorganize Partitions

Over time, partitions can become imbalanced. Use OPTIMIZE and VACUUM to maintain performance.

# Vacuum to clean up old files 
spark.sql(""" 
VACUUM delta.`/mnt/datalake/sales_data` RETAIN 7 HOURS 
""")

Rebalancing: If partitions are skewed, consider repartitioning or reorganising the data.

Partitioning is not a one-time task; it requires ongoing optimisation and monitoring. These best practices, combined with the right tools in Azure Databricks, will ensure a scalable, high-performing data platform.

What’s your take on these strategies? Let me know in the comments!

Hopefully you enjoyed a little peek into what content is going to be covered. Remember the first part drops on the 19th of this month. Also, check out the release radar to make sure you know when and what these parts are about: Release Radar

So subscribe if you haven’t already so you don’t have to worry about missing a drop, and thanks again for reading!

Share That Data Guy's Newsletter

Data Bytes & Insights

Discussion about this post