question

IR-1518 avatar image
0 Votes"
IR-1518 asked ShaikMaheer-MSFT commented

Data Factory Dataflow with Snowflake warehouse - Schema drift

Hi,

we are doing a POC to use ADF dataflow mappings to load data in Snowflake WH tables. We tried to create a generic dataflow to read from the snowflake table and write to another snowflake table using schema drift option enable.
source table: 10 columns - 5 column names matches with target
Target Table: 5 columns

But the dataflow is failing at the source column which is not present in target table with error as
"Message":"Job failed due to reason: at Sink 'SinkHub': java.sql.BatchUpdateException: SQL compilation error: error line 1 at position 3\ninvalid identifier 'XYZ'"

Dataflow is successful only if we keep exact column names in source table like target table. We are expecting that columns matched should get populated when Auto Mapping is enabled.
That means dataflow schema drift option is not working with snowflake database.

anyone faced similar issue?.

Thanks

azure-data-factory
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

ShaikMaheer-MSFT avatar image
0 Votes"
ShaikMaheer-MSFT answered ShaikMaheer-MSFT commented

Hi @IR-1518 ,

Welcome to Microsoft Q&A Platform. Thank you for posting your query here.

In Data flows, for auto mapping to work, your column names between source and sink should match exactly same.

If your column names not matches between source and sink then you should disable auto mapping inside Sink transformation mappings tab and manual map columns accordingly.

Schema drift is the case where your sources often change metadata.

In a source transformation, schema drift is defined as reading columns that aren't defined your dataset schema. When schema drift is enabled, all incoming fields are read from your source during execution and passed through the entire flow to the Sink. By default, all newly detected columns, known as drifted columns, arrive as a string data type.

In a sink transformation, schema drift is when you write additional columns on top of what is defined in the sink data schema.

So, schema drift is case where you get additional columns or reading columns which are not defined in your dataset schema. It is not the case of mapping columns between source and sink when column names doesn't match.

You can refer below small video also to know about schema drift in Mapping data flows.
https://www.youtube.com/watch?v=LS0u7DxhpDI

Click here to know more about schema drift.

Hope this helps. Please let us know if any further queries. Thank you.


  • Please accept an answer if correct. Original posters help the community find answers faster by identifying the correct answer. Here is how.

  • Want a reminder to come back and check responses? Here is how to subscribe to a notification.

· 8
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi @ShaikMaheer-MSFT,

Thanks for responding.

yes, my source & sink column names are matching exactly, its just that source is having more columns than target.
my expectation is, matching columns between source & target should be populated but its not the case.

I have tested the DF with more columns in sink than source using schema drift and its running fine but other way is failing(more columns in source than sink).

please see if can you run a test with more columns in source & less in sink with schema drift.

Thanks

0 Votes 0 ·

Hi @IR-1518 ,

I tested by having more columns in Source than sink. Its working fine for me with automapping(Make sure column names in source and sink are same)

Please check below screenshots, where my Source dataset has 3 columns & Source transformation project also shows 3 columns but Source data preview shows 4 columns.
And, Sink table has only 3 columns and Sink Transformation has auto mapping enabled.

Source dataset: (3 columns)

126975-image.png

Source Transformation Projection: (3 columns)

126976-image.png

Source Transformation Data preview: (4 columns)

127036-image.png

Sink Table:(3 columns)

126977-image.png

Sink Transformation mappings:(Auto mapping)
127071-image.png

Could you please check and make sure you have schema drift enabled in source and sink transformations? Also, could you please check and make sure your column names are exactly matching between source and sink? Also, you datasets have schema imported in them?

Could you please help if I am missing something?

Also, if its helpful, Please consider accepting answer? Accepted Answers helps community as well. Please let us know if any further queries. Thank you.

1 Vote 1 ·
image.png (49.3 KiB)
image.png (58.2 KiB)
image.png (95.2 KiB)
image.png (20.2 KiB)
image.png (77.7 KiB)

Hi @ShaikMaheer-MSFT


Thank you so much. Really appreciate your efforts in resolving this issue.
As I mentioned earlier, I am creating a generic dataflow without any schema in source & sink datasets, so that we can reuse the same dataflow for any source & target table with same ETL.
Importing schema at dataset might work as ADF already knew what to get from source but that is not the functionality of schema drift(late binding).

As part of schema drift functionality, dataflow shouldn't fail even if the columns are added/removed in source in future. Correct me if this is incorrect understanding.

please try without importing schema at datasets for the same scenario. That might work for you with file as source & sql db as sink but its not working with snowflake table as source & sink.


again, thank you so much for your time & contribution.


Thanks

0 Votes 0 ·

Hi @IR-1518 ,

I tried without importing schema at datasets(source & sink) with "Allow schema drift" enabled at both Source and Sink transformations. Its working perfectly fine. All matching columns data is getting populated in Sink Table.

Please check below images for same.

Source Transformation:
127584-sourcetransformation.gif

Sink Transformation:
127585-sinktransformation.gif


Could you please re-check is Allow Schema drift option enabled in both Source and Sink Transformations? As per documentation, there is not such limitation mentioned for Snowflake connector. So curious to reconfirm this.

Unfortunately, I don't have snowflake setup. Hence once you confirm, I will escalate this internally to see if there is any such known limitation on snowflake connector.

0 Votes 0 ·

Hi Maheer,

yes, I have enabled schema drift & Auto mapping properties at both source & sink transformations.

can you please check with product team about this?.

Thanks

0 Votes 0 ·
Show more comments

Because you are writing to a defined schema in the Sink using the Snowflake sink, ADF must match the target column names. If you are modifying the schema, you will need to either write to a new table or use the Recreate Table option in the Sink. Alternatively, you can write the output to a Parquet format ADLS Gen2 sink and then load that data into your Snowflake destination with a separate pipeline activity after your data flow transformation step.

0 Votes 0 ·