Why Z-Ordering Fails on Skewed Data — and Liquid Clustering Does Not
Z-ordering and Liquid Clustering both aim to improve Databricks query performance through data skipping. But when your data is skewed, one of them quietly becomes useless. A visual explanation of why — and how the Hilbert curve changes everything.
After twenty years of building enterprise data warehouses, I have learned that data is never evenly distributed. Customer tables are dominated by one country. Transaction tables are dominated by one product line. Event tables are dominated by one event type.
This is skew. And skew is what separates a data layout strategy that works in theory from one that works in production.
Databricks offers two approaches to organising data within Delta Lake tables: Z-ordering and Liquid Clustering. Both aim to improve query performance through data skipping — the ability to avoid reading files that cannot contain relevant rows. Both sound promising in documentation. But when your data is skewed, one of them quietly becomes useless.
Want more practical data engineering analysis like this?
Join DWHPro Letters and get field-tested notes on Teradata, Snowflake, AI, migrations, performance, and enterprise data work. Early subscribers keep launch access before the paid plan launches.
This article explains why, with illustrations that make the mechanism visible.
The setup: a skewed dataset
Consider a transaction table with five countries: US, DE, ES, IT, and FR. Eighty percent of the records come from the United States. The remaining twenty percent is split evenly among the other four countries.
This is not an unusual distribution. In most enterprise datasets I have worked with, a single value dominates at least one important column.
Raw data: 80% US, 5% each DE, ES, IT, FR Each strip represents records for that country across the date range
US (80%)
DE (5%)
ES (5%)
IT (5%)
FR (5%) The US row is packed. The other rows are sparse. This is skew.
What Z-ordering does with skewed data
Z-ordering uses a Z-order space-filling curve to sort data across multiple columns simultaneously. When you run OPTIMIZE my_table ZORDER BY (country, date), the engine reads all data, interleaves the bits of the country and date values using the Z-curve, sorts the records, and writes them into new files.
The critical property of the Z-curve is that it treats all values uniformly. It does not know or care that US appears eighty times more often than FR. It zigzags through the two-dimensional space in a fixed pattern.
After Z-ordering: US records contaminate every file The Z-curve assigns records to files uniformly. Every file contains a mix. File 1
File 2
File 3
Get the next issue by email.
File 4
File 5
Every file’s min/max for country spans DE to US. No file can be skipped for any country query.
The result is visible in the illustration above. Because US records are everywhere, every file contains a mix of all countries. When the query engine checks whether a file could contain German records, it looks at the min/max statistics and concludes: yes, this file could contain DE rows. For every single file. No files can be skipped.
What Liquid Clustering does differently
Liquid Clustering uses a Hilbert space-filling curve instead of a Z-order curve. The Hilbert curve has a property that matters enormously for skewed data: better locality preservation. Points that are close together in multi-dimensional space stay closer together in the one-dimensional ordering.
In practical terms, the Hilbert curve spends more time in dense regions before moving to sparse ones. When eighty percent of the data sits in the US row, the curve fills that row first, producing files that contain only US records. Then it visits the sparse country rows and groups them tightly together.
After Liquid Clustering: each file has a tight country range The Hilbert curve adapts to the distribution. US gets dedicated files. Minorities grouped.
Three files contain only US. Two files group the minority countries tightly. Every file has a narrow min/max range.
Three files contain only US records. One file contains DE and ES. One file contains IT and FR. Each file has a tight min/max range for the country column. When the query engine checks whether a file could contain German records, it skips four out of five files instantly.
Z-ordering was designed for a world where data is reasonably distributed. Liquid Clustering was designed for the world we actually live in.
The skew advantage: both majority and minority queries benefit
Here is the insight that most explanations miss. Liquid Clustering does not merely help queries on minority values. It helps queries on the majority value as well.
Three queries, same data, side by side Z-ordering always scans everything. Liquid Clustering skips files on every query — even for the dominant value.
Query 1: WHERE country = ‘DE’ (minority)
Query 2: WHERE country = ‘US’ (majority — the skewed value)
Query 3: WHERE country = ‘US’ AND date = ‘March’ (filtered majority)
The maintenance difference
There is a second dimension to this comparison that matters in production: what happens when new data arrives.
500 new US records arrive Land as small unsorted files ↓ Z-order immediately degrades ↓ Must re-run OPTIMIZE on entire table ↓ Expensive. Manual. Same mixed result.
500 new US records arrive Engine detects: all US records ↓ Merges with existing US files only ↓ Rewrites 1-2 files. Others untouched. ↓ Cheap. Automatic. Clustering stays tight.
Z-ordering is a static, full-table operation. The moment new data lands — which, in a skewed dataset, will overwhelmingly be more US records — the Z-ordering starts degrading. To restore it, you must re-run OPTIMIZE on the entire table. On a large table, this is expensive. On a skewed table, it is expensive and produces the same ineffective result as before.
Liquid Clustering works incrementally. When new US records arrive, the engine merges them into the appropriate cluster, rewriting only the affected files. The minority files are never touched. The clustering remains tight without a full-table rewrite, without manual scheduling, and without the associated compute cost.
Summary
| Z-ordering | Liquid Clustering | |
|---|---|---|
| Curve type | Z-order (uniform) | Hilbert (locality-preserving) |
| Skew handling | Treats all values equally — dominant value contaminates every file | Adapts to distribution — dominant value gets dedicated files |
| Minority query | All files scanned | Most files skipped |
| Majority query | All files scanned | Minority files skipped |
| Maintenance | Manual full-table OPTIMIZE | Automatic incremental |
| New data | Degrades immediately | Merges incrementally |
| Syntax | OPTIMIZE t ZORDER BY (c) | ALTER TABLE t CLUSTER BY (c) |
The bottom line: Z-ordering was designed for a world where data is reasonably distributed. Liquid Clustering was designed for the world we actually live in.
📚 Want more like this?
Join 4,000+ data warehouse professionals who get practical insights on Teradata, Snowflake, and Databricks delivered to their inbox. Subscribe Free
No spam. Unsubscribe at any time.
Planning or surviving an enterprise data platform migration?
I write regularly about the performance, cost, architecture, and project mistakes that show up in real Teradata, Snowflake, Databricks, and enterprise data work.
Subscribe before the paid plan launches and keep launch access.
Written by Roland Wenzlofsky, founder of DWHPro and author of Teradata Query Performance Tuning. DWHPro has helped data warehouse practitioners for 15+ years.