Hi, I have ~3GB of CSV data in Azure Storage and am using Serverless SQL Pools to query this data using a View. If I do a query such as this I get a data processed total of around 3GB (which is to be expected as it's scanning all the data)
SELECT COUNT()
FROM
OPENROWSET
(
BULK 'sourcedatadesktopdataflow/salesorderline/OrderYear=/OrderMonth=/OrderDatePartition=/*.csv',
DATA_SOURCE = 'ExternalDataSourceDataLake',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW = TRUE,
FIELDTERMINATOR =','
)
WITH
(
OrderLineID INT,
OrderDate DATE
)
AS fct
However, if I group by a column in the CSV data, the data processed is doubled to 6GB and I can't work out why this would be the case. Does anyone have any suggestions?
SELECT OrderDate,COUNT()
FROM
OPENROWSET
(
BULK 'sourcedatadesktopdataflow/salesorderline/OrderYear=/OrderMonth=/OrderDatePartition=/*.csv',
DATA_SOURCE = 'ExternalDataSourceDataLake',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW = TRUE,
FIELDTERMINATOR =','
)
WITH
(
OrderLineID INT,
OrderDate DATE
)
AS fct
GROUP BY OrderDate