question

RyanAbbey-0701 avatar image
0 Votes"
RyanAbbey-0701 asked ShaikMaheer-MSFT commented

One CSV to multiple Parquet files

We have one large CSV file that we are looking to transfer in to Parquet and based on the recommended standard of up to 1GB parquet files, splitting across a few files however running in to a few issues

  1. If we don't specify a file within the parquet definition and specify e.g. 10,000,000 rows per file, what we find is the copy activity is autogenerating a subfolder based on the input file name which we don't want.

  2. If we extend 1 to specify a "File name prefix", we get error FileNamePrefixNotSupportFileBasedSource (I note the info box does say you can't specify a prefix with file based sources)


So how do we stop it generating a subfolder based on the source file name? It seems pretty restrictive and illogical to force an unwanted subfolder (a MS trait that hasn't stopped through the years!)



azure-data-factory
· 7
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi @RyanAbbey-0701 ,

Thank you for posting query on Microsoft Q&A platform.

Could you please help on below details. So that we can repro scenario to help you better with detailed implementation.

If we don't specify a file within the parquet definition and specify e.g. 10,000,000 rows per file, what we find is the copy activity is autogenerating a subfolder based on the input file name which we don't want.

Could you please share screenshot of your parquet dataset here along with parameter details if any declared in dataset.

If we extend 1 to specify a "File name prefix", we get error FileNamePrefixNotSupportFileBasedSource (I note the info box does say you can't specify a prefix with file based sources)

Could you please share screenshot this dataset configuration

It would be great help if we get above details to solve your issue. Thank you

0 Votes 0 ·

Hi @RyanAbbey-0701 ,

Could you please share details on above commented clarifications. This will help to understand issue better and provide detailed resolution

0 Votes 0 ·

Apologies, forgot all about the questions as we moved on...


Hopefully the images actually show...

  1. For the file xx_fct.zip, we want to create a folder (as highlighted) that is renamed to suit our needs and add parquet files to this folder. Instead, it's created a subfolder xx_fct.zip

110015-image.png
109988-image.png

  1. These settings are as per Sink settings - we cannot enter a "file prefix" because we are coming from a file source
    109940-image.png


0 Votes 0 ·
image.png (2.4 KiB)
image.png (8.5 KiB)
image.png (4.8 KiB)

Hi @RyanAbbey-0701 ,

Thank you for reframing your ask. Small clarifications here,

  • When you would like to move "xx_fct.zip" file in to "iri_FCT_20210615" folder, do you want to unzip file to extract all files in it?

  • Date(yyyyMMdd) mentioned in your folder "iri_FCT_20210615" should the date on which you are trying to copy?

Kindly share details on above clarifications, that helps to provide detailed resolution. Thank you




0 Votes 0 ·
  1. Yes, there's actually only one file within the zip but it will be unzipped (and re-compressed for parquet) although if there is a way to not unzip and restructure for parquet I'm willing to listen

  2. Close enough, it's dynamically determined via a number of factors

0 Votes 0 ·

Hi @RyanAbbey-0701 ,

Just checking is below provided answer helps you? If yes please Accept Answer. Accepting answer will help community. Thank you

0 Votes 0 ·

Hi @RyanAbbey-0701 ,

Following up to check is below provided answer helps you? If yes please Accept Answer. Accepting answer will help community. Thank you.

0 Votes 0 ·

1 Answer

ShaikMaheer-MSFT avatar image
1 Vote"
ShaikMaheer-MSFT answered ShaikMaheer-MSFT commented

Hi @RyanAbbey-0701 ,

Please check detailed example, Which Copies file to folder(folder name will be dynamically created as you requested above(iri_FCT_yyyyMMdd))
Step1: Create a variable in your pipeline to hold current date. Use set variable activity to set value in it.
111190-setvariable.gif

Step2: Use Copy activity to copy zip file. Source and Sink dataset types should be binary. In sink data set we should create a parameter which will dynamically give us target folder name as "iri_FCT_yyyyMMdd"
111241-copyactivity.gif

Hop this will help.


  • Please accept an answer if correct. Original posters help the community find answers faster by identifying the correct answer. Here is how.

  • Want a reminder to come back and check responses? Here is how to subscribe to a notification.


setvariable.gif (1.5 MiB)
copyactivity.gif (1.3 MiB)
· 5
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

That would copy the zip file to a directory of same name? We are trying to convert to parquet and split in to multiple files at the same time

0 Votes 0 ·

Hi @RyanAbbey-0701 ,

You can convert single csv file to multiple parquet files using dataflows.

In data flows Sink Transformation you can use partitions to partition your data and save as separate files.

To know more about partitions in data flow, please check below link,
https://docs.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance#optimize-tab

In below example I am partitioning file in to 2 partitions.

112493-image.png

Hope this helps.


Please Accept Answer if this helps. Thank you.

0 Votes 0 ·
image.png (215.2 KiB)

Hi @RyanAbbey-0701 ,

Just checking is below provided answer helps you? If yes please Accept Answer. Accepting answer will help community. Thank you

0 Votes 0 ·
Show more comments