Data Processed when using aggregate query

Question

Hi, I have ~3GB of CSV data in Azure Storage and am using Serverless SQL Pools to query this data using a View. If I do a query such as this I get a data processed total of around 3GB (which is to be expected as it's scanning all the data)

SELECT COUNT()
FROM
OPENROWSET
(
BULK 'sourcedatadesktopdataflow/salesorderline/OrderYear=/OrderMonth=/OrderDatePartition=/*.csv',
DATA_SOURCE = 'ExternalDataSourceDataLake',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW = TRUE,
FIELDTERMINATOR =','
)
WITH
(
OrderLineID INT,
OrderDate DATE
)
AS fct

However, if I group by a column in the CSV data, the data processed is doubled to 6GB and I can't work out why this would be the case. Does anyone have any suggestions?

SELECT OrderDate,COUNT()
FROM
OPENROWSET
(
BULK 'sourcedatadesktopdataflow/salesorderline/OrderYear=/OrderMonth=/OrderDatePartition=/*.csv',
DATA_SOURCE = 'ExternalDataSourceDataLake',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW = TRUE,
FIELDTERMINATOR =','
)
WITH
(
OrderLineID INT,
OrderDate DATE
)
AS fct
GROUP BY OrderDate

Answer

Hello @Serverless SQL ,
Thanks for the ask and using the Microsoft Q&A platform .

Please check the query plan in both cases and see what the difference . It may be sorting/spooling/hashing + spilling .

Please do let me know how it goes .
Thanks
Himanshu
Please do consider clicking on "Accept Answer" and "Up-vote" on the post that helps you, as it can be beneficial to other community members

Answer

Hi, this is serverless sql pools, is there an option to show the query plan?

Data Processed when using aggregate query

2 answers