question

thracian avatar image
0 Votes"
thracian asked ShaikMaheer-MSFT commented

Remove '00001' suffix in file name generated by data flow

I'm using Azure Data Factor data flow to save the incoming data as partitioned *.parquet files (Year/Month). I'm using the pattern setting for names of the files, as shown in the screenshot below. ADF automatically appends "00001" to the file name which I don't need because I use an expression to generate the file name, e.g. "Sales Date=2021-08-07-00001". The Optimize tab is set to Key partition type.

Is there any way to remove the '00001" suffix in the file name?

121371-image.png




azure-data-factory
image.png (71.3 KiB)
· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

The Pattern option uses the Spark partition naming scheme together with your pattern to devise the target file name. If you're looking to take full control over the final file name, then use the option Output to Single File.

0 Votes 0 ·

Thanks Mark,

"Output to Single File" complains about performance. In addition, the sink automatic partitioning doesn't work.

To give you a bit more info, we're extracting data from on-prem tables 1:1 to parquet files in ADLS 2.0. Some tables are large and will require an incremental extraction, such as to get the latest data since the last watermark and store it as a separate file. So, incremental tables will have many files, such as:

Sales Date=1900-01-01 (full load)
Sales Date=2021-08-09 16:32:34 (fist incremental load)
Sales Date=2021-08-10 01:30:14(second incremental load)

We want to take advantage of the automatic partitioning by key. "Output to single file" won't work, will it? I guess we just to keep the '00001' suffix.




0 Votes 0 ·

In your case, you can just use the Key Partitioning option to set folders based on those values

https://youtu.be/7Q-db4Qgc4M?t=401

https://youtu.be/Samj4b_ZSrY?t=394

0 Votes 0 ·
thracian avatar image
0 Votes"
thracian answered ShaikMaheer-MSFT commented

How are these Spark partition files generated? I thought that they all will have '00001' suffix. But after staging a large dataset, I see that now they have different numbers. What will happen if I rerun the load? Will Spark retain the same numbers? Is there a way to control the size of the partition?

123655-image.png



image.png (6.3 KiB)
· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

You can set the number of partitions with other partitioning types, but not key partitions

0 Votes 0 ·

Hi @TeoLachev-9086 ,

Following up to check is above answer helps you? If Yes, Please Accept Answer. Accepting answer helps community as well. Thank you.

0 Votes 0 ·
thracian avatar image
0 Votes"
thracian answered ShaikMaheer-MSFT commented

Which is what we use. It looks like there isn't a way to ignore the Spark partitioning scheme. The suggested Filename[n] pattern implies that I can remove the "n" and thus remove -00001 being added to each file.

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi @TeoLachev-9086 ,

Thank you for posting your query in Microsoft Q&F Platform. Is your above comment worked for you? If yes, Could you please mark that as Accepted Answer? Accepting answers helps community too.

Please let us know if any further queries. Thank you.

0 Votes 0 ·