question

arkiboys avatar image
0 Votes"
arkiboys asked arkiboys commented

read new column from .parquet

Hi,
The sink parquet dataset files have two columns:
column1, column2

I have created a serverless sqlpool view which reads the .parquet file from the blobstorage container folder...
If I want to pull another column, i.e. column3 into the existing .parquet files, then the view will not be able to read the new column unless I separate the old and new .parquet files.
What is the solution to keep the old and new .parquet files in the same blobstorage directory and yet be able to read the new column too?
Thank you

azure-data-factory
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

ThomasBoersma avatar image
0 Votes"
ThomasBoersma answered arkiboys commented

Hi @arkiboys ,

You can achieve this by using the WITH statement in combination with the OPENROWSET. I also suggest that you check the docs here.

Here is an example:

I have two parquet files: clock1.parquet and clock2.parquet. The file clock1.parquet doesn't have the column Seconds, but the clock2.parquet file does have that column. In the screenshot you can see that the first and second files are merged together and that the results from the column Seconds in the first file is set to null.

 SELECT 
     Time,
     Hour,
     Minute,
     Seconds,
     [result].filepath() AS filepath,
     [result].filename() AS filename
 FROM OPENROWSET(
         BULK 'https://{storagename}.dfs.core.windows.net/data/Test/*.parquet',
         FORMAT='PARQUET'
     ) WITH (
  [Time] varchar(8),
  [Hour] int,
  [Minute] int,
  [Seconds] int
  ) AS [result]

131889-image.png


Hope this helps.



image.png (47.3 KiB)
· 6
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi,
I use the following
WITH(
[Object] NVARCHAR(50),
[PipelineName] NVARCHAR(50),
[RunDate] datetime,
[RowsRead] int
)

but the error I get is:
Column 'RowsRead' of type 'INT' is not compatible with external data type 'Parquet physical type: BYTE_ARRAY, logical type: UTF8', please try with 'VARCHAR(8000)'.

0 Votes 0 ·

Hi,

It seems that you are trying to convert a BYTE_ARRAY into a integer, that is not possible. Is the content of the column RowsRead the expected content? Because RowsRead needs to be a number, if I understand the column name correctly, and not a BYTE_ARRAT. Try to use VARCHAR(MAX) for the column RowsRead if the content is correct, i.e.: [RowsRead] VARCHAR(MAX), otherwise try to change the content.

If that didn't help, can you tell us a bit more about the contents of the parquet file?

0 Votes 0 ·

Yes, changing to varchar solves the issue but because the RowsRead is a number I wonder if I should have it as int instead and if so , then why it errors.?

0 Votes 0 ·
Show more comments