Power Query super slow after sorting and filtering?

Question

I am really having trouble with the slow speed of Power Query after I make some necessary transformations. First, I am using

Microsoft 365 MSO (16.0.14326.20384) 64-bit
Edition Windows 10 Pro
Version 21H1
Installed on ‎7/‎27/‎2021
OS build 19043.1237

Computer has 64GB of RAM.

I am working with a large .csv dataset that has 36 million rows. What I have noticed is the following - if I leave it totally unfiltered or unsorted, I can make transformations quickly, often very quickly. For example, if I add an index column, it takes about 1 second. But as soon as I filter the data or sort it, the same index column takes many minutes, so many that I have never let it finish. A more complex custom column using M Code took over and hour before I stopped it These transformations are necessary, particularly the sorting, so I can't leave them out. Btw, I have similar problems with smaller datasets, though not as dramatic, as every step after sorting and filtering is much slower.

Is there a reason why the filtering and sorting slows everything down so much? And can anything be done to improve the speed?

Thank you!

Answer

When you're previewing the raw file, PQ only needs to read the first few hundred rows in order to render the preview. However, as soon as you do an operation like Sort, PQ now needs to read all 36 million rows in order to display the first few hundred rows, since it's possible that the first few hundred sorted rows are waaaay down at the end of the list.

The same is true for certain other operations. They require reading the full data source, and thus will be slow.

However, some operations (such as Filter) are different. They can operate over the data in what's called a streaming manner, reading only as many rows as are needed to populate the first few hundred rows of the preview. If I filter for rows where the customer is "Contoso", PQ only needs to iterate over the rows until it's found a few hundred rows that match the filter; the rest of the file can be ignored.

Suggestions:

Always do streaming operations up front, when possible, leaving non-streaming operations until later.
When creating your queries, consider first doing a "Keep First Rows" operation and limiting the number of rows you're working against. Then, once you've added all the steps you need, remove the "Keep First Rows" step.

Power Query super slow after sorting and filtering?

1 answer