question

AMJ-2211 avatar image
0 Votes"
AMJ-2211 asked AMJ-2211 answered

Remove newlines inside a text field of dynamic csv columns

As mentioned in the title there are newlines ({CRLF}) embedded in text fields of csv files such as

,,,,,"this is text field {CRLF} which finishes here",,,,,

Normally you will use a function like: regexReplace(<Column name>, ,"([\r\n|\r|\n]+)",, ' ') https://docs.microsoft.com/en-us/azure/data-factory/data-flow-expression-functions to replace {CRLF} with a space. The issue is the <Column name> is dynamic. Here is the simplified extract of a csv file:


197119-image.png


As one can see the text field denoted by a pair of double quotes has span multiple lines as the result of the embedded {CRLF} within the text field. The dynamic <Column name> means you don't know which column in a csv file is a text field. Also there would be multiple csv files through a single pipeline. The schema in each csv file is different. Please chip in your thoughts.


azure-data-factoryazure-databricks
image.png (20.4 KiB)
· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@AMJ-2211 : Can you let me know how the CSV is generated, is the CSV is generated from a database as source via query or any other form?

0 Votes 0 ·

@PratikSomaiya The csv was generated via a API call and stored to a data lake as below:

197156-image.png


0 Votes 0 ·
image.png (45.4 KiB)
PratikSomaiya avatar image
0 Votes"
PratikSomaiya answered AMJ-2211 commented

@AMJ-2211

Can you apply the Replace function in your API query body itself? Are you using a Web Activity for API call or any via other way?

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@PratikSomaiya Please see below:

197187-image.png



This was a HTTP call with binary format. The request body was empty. Could you supply a sample code regarding to using the function regexReplace() in this body? Bear in mind this would be a binary stream at the point of time before a csv file is formed.

0 Votes 0 ·
image.png (31.1 KiB)
ShaikMaheer-MSFT avatar image
0 Votes"
ShaikMaheer-MSFT answered

Hi @AMJ-2211 ,

Thank you for posting query in Microsoft Q&A Platform.

If possible try to see if you can apply your replace logic on response itself and then storing it as csv file. That way it will reduce lot of efforts.

If you would like to play around the logic on csv file after loading data from API then we should consider using data flows. Firstly, try to read entire file data as single column. For the you should have your dataset settings as below.
198161-image.png

Once you get entire file data as single column then apply your replace logic using derived column. and then consider breaking your single column single data as multiple items using split() function on space. But if your any column has space in there data then that will also get splitted. Hence we should consider omit splitting of data which is wrapped insider quotes.

Once we splitted data as array then flatten that data as multiple rows and load as file. and then use this newly created file as source for further processes.
198097-image.png

All above mentioned logic handling may be difficult and sometimes dataflows not much flexible to have this logic. So best way is having your custom code written in Azure functions to read your file data and do the replace() accordingly.

Hope this helps. Please let us know how it goes. Thank you.


Please consider hitting Accept Answer. Accepted answers help community as well.


image.png (104.1 KiB)
image.png (230.6 KiB)
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

AMJ-2211 avatar image
0 Votes"
AMJ-2211 answered

@ShaikMaheer-MSFT Thank you so much for your thoughts. Unfortunately it won't work. I modified your sample data to closely imitate the real situation as below:

198165-image.png

As you could see there are 2 issues applying your method to a read data:

  • There are spaces within a column which is not surrounded by a pair of double quotes. They are legitimate in one column;

  • There exist double quotes within the double quotes of the text delimiter




image.png (13.5 KiB)
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.