question

DivakarKrishnan-8551 avatar image
0 Votes"
DivakarKrishnan-8551 asked DivakarKrishnan-8551 commented

Wildcard folder design and ETL Log table logging.

Dear Team,
I need your assistance for the two real time scenarios.

Issue 1:
1) In our Azure data lake, I have the following data folder structure, my source is CSV files from ADLS and SINK is Azure SQL.

Eltizamraw --- > ManageEngine --- > IT
Eltizamraw --- > ManageEngine --- > Non_IT

I am receiving the files daily in both folders (IT & Non_IT), so I designed my source dataset as below using the Wildcard file path. When I execute the pipeline, all files are picked-up properly but in AZSQL records are getting duplicated, it’s showing same count as loaded for all the files.

Temporarily, I have created two different pipelines pointing to folder individually and loading our data. But I need a help to load this data using182874-issue1.png single pipeline / single source dataset.




Issue 2:
2) By using ForEach activity, I am loading multiple CSV files to Azure SQL Sink, I have designed my pipeline like below (Same Issue1.png) to capture the file name, sink table name, insert count, start time, end time, status etc..

When I am loading multiple files to same table, all files are loading correctly as expected but ETL logging is not happening correctly with my current pipeline design. For example, if I am loading 4 files to AZ SQL then my ETL log table contains same details for four times instead of individual file load details. Please assist me on how to log the ETL log details for each file correctly.


Thanks,
P.K.Divakar


azure-data-factory
issue1.png (87.8 KiB)
· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

MartinJaffer-MSFT avatar image
2 Votes"
MartinJaffer-MSFT answered DivakarKrishnan-8551 commented

Hello @DivakarKrishnan-8551,
Thanks for the question and using MS Q&A platform.

So if I understand correctly, you have 2 subfolders (IT, non_IT) getting files, and you want to load all files from both subfolders into one SQL. The problem is seeing duplicate records in SQL.

I notice you have both parameterized dataset and using wildcard file path option in source. I assume the dataset parameter is for folder or filename.
I started wondering which would take precedence, details set by the parameter, or the wildcard filepath. So I tried it out. I found the wildcard filepath overruled the parameter.

In hindsight, this makes sense, as the other option is "file path in dataset". This is what the dataset parameter is for changing.

Since you are using the wildcard file path, option, and using the * in the file spot, everything is getting copied. Since you are doing this in a foreach loop, everything is getting copied once per loop. This is why you have duplicates.

If you are not missing any records, then I recommend discarding the loop. The wildcard should be enough by itself.

If you have logic in the parameter you need, I recommend moving the wildcard to "eltizamraw / ManageEnginge/* / (empty)
When the second block in the wildcard is empty, the TargetFileName can take effect. There are 2 blocks in both the dataset and the wildcard filepath. The * overrides the corresponding block.

Please do let me if you have any queries.

Thanks
Martin


  • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how

  • Want a reminder to come back and check responses? Here is how to subscribe to a notification

  • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators


· 11
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello @MartinJaffer-MSFT ,

Yes, it works for the Issue 1, thanks a lot.

Solution for Issue 1:
"eltizamraw / ManageEnginge/* / (empty)

Meanwhile, can you please assist on issue #2 which is ETL logging.

Issue 2:
2) By using ForEach activity, I am loading multiple CSV files to Azure SQL Sink, I have designed my pipeline like below (Same Issue1.png) to capture the file name, sink table name, insert count, start time, end time, status etc..

When I am loading multiple files to same table, all files are loading correctly as expected but ETL logging is not happening correctly with my current pipeline design. For example, if I am loading 4 files to AZ SQL then my ETL log table contains same details for four times instead of individual file load details. Please assist me on how to log the ETL log details for each file correctly.


Thanks,
P.K.Divakar

0 Votes 0 ·
nasreen-akter avatar image nasreen-akter DivakarKrishnan-8551 ·

Hi @DivakarKrishnan-8551, you can try GetMetadata Activity --> Child Items in your pipeline. Thanks!

0 Votes 0 ·

Hi @nasreen-akter ,

For issue#2, I am already using Child Items as Argument in GetMetadata Activity.

In insert ETL log (Stored Procedure) activity inside of for each, I am passing the following expression. When I have 5 input files, only file name is repeating for 5 times and storing in ETL log table instead of five different file name, but again records are correctly inserted in SINK activity.


@substring(item().name,0,15)

Any suggestion please?

183747-image.png


0 Votes 0 ·
image.png (47.2 KiB)
Show more comments