This section describes practices to improve query performance in Delta Lake.
Provide data location hints
If you expect a column to be commonly used in query predicates and if that column has high cardinality
(that is, a large number of distinct values), then use
Z-ORDER BY. Delta Lake automatically lays out
the data in the files based on the column values and use the layout information to skip irrelevant data while querying.
For details, see Z-Ordering (multi-dimensional clustering).
Choose the right partition column
You can partition a Delta table by a column. The most commonly used partition column is
Follow these two rules of thumb for deciding on what column to partition by:
- If the cardinality of a column will be very high, do not use that column for partitioning. For example,
if you partition by a column
userIdand if there can be 1M distinct user IDs, then that is a bad partitioning strategy.
- Amount of data in each partition: You can partition by a column if you expect data in that partition to be at least 1 GB.
If you continuously write data to a Delta table, it will over time accumulate a large number of files, especially if you add data in small batches. This can have an adverse effect on the efficiency of table reads, and it can also affect the performance of your file system. Ideally, a large number of small files should be rewritten into a smaller number of larger files on a regular basis. This is known as compaction.
You can compact a table using the OPTIMIZE command.