Data Processed when using aggregate query

Serverless SQL 216 Reputation points MVP
2021-06-05T18:21:50.833+00:00

Hi, I have ~3GB of CSV data in Azure Storage and am using Serverless SQL Pools to query this data using a View. If I do a query such as this I get a data processed total of around 3GB (which is to be expected as it's scanning all the data)

SELECT COUNT()
FROM
OPENROWSET
(
BULK 'sourcedatadesktopdataflow/salesorderline/OrderYear=
/OrderMonth=/OrderDatePartition=/*.csv',
DATA_SOURCE = 'ExternalDataSourceDataLake',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW = TRUE,
FIELDTERMINATOR =','
)
WITH
(
OrderLineID INT,
OrderDate DATE
)
AS fct

However, if I group by a column in the CSV data, the data processed is doubled to 6GB and I can't work out why this would be the case. Does anyone have any suggestions?

SELECT OrderDate,COUNT()
FROM
OPENROWSET
(
BULK 'sourcedatadesktopdataflow/salesorderline/OrderYear=
/OrderMonth=/OrderDatePartition=/*.csv',
DATA_SOURCE = 'ExternalDataSourceDataLake',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW = TRUE,
FIELDTERMINATOR =','
)
WITH
(
OrderLineID INT,
OrderDate DATE
)
AS fct
GROUP BY OrderDate

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,253 questions
{count} votes

2 answers

Sort by: Most helpful
  1. HimanshuSinha-msft 19,376 Reputation points Microsoft Employee
    2021-06-08T20:28:28.65+00:00

    Hello @Serverless SQL ,
    Thanks for the ask and using the Microsoft Q&A platform .

    Please check the query plan in both cases and see what the difference . It may be sorting/spooling/hashing + spilling .

    Please do let me know how it goes .
    Thanks
    Himanshu
    Please do consider clicking on "Accept Answer" and "Up-vote" on the post that helps you, as it can be beneficial to other community members

    0 comments No comments

  2. Serverless SQL 216 Reputation points MVP
    2021-06-24T15:17:42.313+00:00

    Hi, this is serverless sql pools, is there an option to show the query plan?

    0 comments No comments