question

chrisw-1001 avatar image
0 Votes"
chrisw-1001 asked Kira-22 commented

Azure Data Factory: Comma causing unwanted column splitting when copying from Excel to CSV


This might seem like the kind of question that has been asked many times, but I can’t seem to find an answer to my specific problem on this forum or elsewhere. I hope that I’m not missing something obvious.

I am using an ADF (v2) copy activity to convert an Excel file to a (new) CSV file. Some of the column headers in the Excel source file have a single comma. Example: “foo, bar”. When inspecting the output CSV file with a text editor, the column headers are not quoted even though “Quote All Text” is selected on the Sink tab of the copy activity. Data in all other rows are quoted as expected. When I open the CSV file in the Excel UI, each column containing a comma in the header is split into two fields. For example, the (single) column “foo, bar” from the Excel file appears as two separate columns in the CSV: “foo” and “bar”, which is undesired.

The only potential solution I can think of is to uncheck “First Row as Header” in the Excel dataset so that the copy activity assigns “Prop_0”, “Prop_1” as the headers in the CSV and then create a second CSV from the first one, with “Skip Line Count” set to 1 in the Source settings on the copy activity. I haven’t tested this solution, but it seems inefficient and potentially costly to scale.

Does anyone know of a better solution?

Thanks in advance for any advice.


azure-data-factory
· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello @chrisw-1001 and welcome to Microsoft Q&A. You have put some good thought into this. I have reproduced the issue and am contemplating alternatives.

If you want to skip the headers in the output csv, there is an easier way: Uncheck the "First Row as Header" in the CSV dataset. You can leave "First Row as Header" checked in the Excel dataset.

If you want to include headers in the output csv, you could explicitly specify them in the output csv dataset schema.

I noticed, when I did a Lookup Activity on the Excel file, the header was read correctly. Perhaps this could be used to dynamically construct the mapping of the copy activity.

Ideally, CSV do not contain commas in any values. Have you considered tab-delimited or pipe-delimited instead of comma-delimited?

I noticed in the Excel application, when I Exported as CSV, the header cell with a comma was quoted. (foo,bar -> "foo,bar"). Similar behavior could be submitted as a feature request.

0 Votes 0 ·

Hi @MartinJaffer-MSFT - thank you for taking the time to reproduce the issue and for the suggestions.

To stay true to our architecture, we would like to convert the source Excel file "as is" to CSV (without changing the column headers or delimiters in either file).

Unfortunately, hard-coding the headers won't work since my the data sets dynamically specify the file via a parameter.

I'll plan to investigate the Lookup activity and the feature request.

Thank you again!


0 Votes 0 ·

1 Answer

MartinJaffer-MSFT avatar image
0 Votes"
MartinJaffer-MSFT answered Kira-22 commented

@chrisw-1001 there is one more solution, not sure why I didn't think of it before.

In Both the source Excel and sink CSV datasets, disable "First row as header". In the copy activity, leave the mapping empty. Leave the CSV dataset schema empty.

Since the first row is not treated as header in either side, the first row gets treated as data. Since the "Quote all text" works on data, the first row with its comma gets quoted correctly.

Before you worry about there not being headers, recall that visually, there is nothing to distinguish header from text. Headers are an artifact of intent in reading/writing data.

36038-image.png
36166-image.png
36211-image.png
36050-image.png



image.png (99.6 KiB)
image.png (133.1 KiB)
image.png (42.1 KiB)
image.png (7.4 KiB)
· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

That worked! I might need to create another data set for the solution to work end-to-end with my pipeline, but the overhead is minimal.

Thanks for taking additional time with my question.

0 Votes 0 ·

Thank you for letting me know. Feedback is always appreciated!

0 Votes 0 ·
Kira-22 avatar image Kira-22 MartinJaffer-MSFT ·

Hi,

I had the same issue , and your solution worked however in my case when the csv files are copied in the azure storage they are in txt format.. is there any reason for that ? I would like to keep the original csv format.


Thank you

0 Votes 0 ·