Mapping parquet format datasets to Azure Datawarehouse

Anmol Ganju 166 Reputation points
2022-01-18T15:45:39.47+00:00

Hi, I have a pipeline that copies all the files (*.parquet) present in different folders into different tables using foreach loop, Now I have particular set of columns defined for each destination tables, for example Table A contains 100 columns, while the parquet file which is as a source contains 200 columns, I want to override the copy functionality so that whatever columns which are there in destination Table A gets copied from these parquet format files, for 1 table this can be done by importing the mappings, but how can this be done if we have 20 tables and for each one of them I want to communicate ADF to only copy columns which are present in destination and not copy everthing (because copy everything will surely give me an error of columns not present in the destination)

How can this be achieved any suggestions?

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,611 questions
0 comments No comments
{count} votes

Accepted answer
  1. HimanshuSinha-msft 19,381 Reputation points Microsoft Employee
    2022-01-20T02:15:52.69+00:00

    <<Continued from the last reply >>
    I am adding the gif , which should help you ,
    166585-paraquet2.gif

    =======================================================================

    166543-paraquet3.gif

    Please do let me know how it goes .
    Thanks
    Himanshu

    -------------------------------------------------------------------------------------------------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

1 additional answer

Sort by: Most helpful
  1. HimanshuSinha-msft 19,381 Reputation points Microsoft Employee
    2022-01-20T01:44:24.55+00:00

    Hello @Anmol Ganju ,
    Thanks for the ask and using Microsoft Q&A platform .

    As I understand the goal here is to load a few columns from the paraquet file to SQLDW .

    At this time when we have a source as paraquet file , there is no way we can have a query option in which you only select few rows . We can achieve something similar by altering the mapping , but then since you have 20 table i think we can do the below implamentation . I have tested that and it should work .

    Steps

    1. Lets have all the records from paraquet file flow to SQLDW to some _stg ( staging table ) . I think you have already implemented this . In my case the dbo.flightData_stg column has 3 columns but the table dbo.flightData will have only one column .
    2. Lets use a Foreach activity and pass the table names like
      166601-image.png
    3. We will use a query in the SQL like

    select STRING_AGG(name,',') as columnName from sys.syscolumns
    where id =object_id('dbo.flightData');

    to get all the columns for the table in the where clause

    166593-image.png

    We will use a dynamic expression like ( i will upload the dynamic expression as a text file )

    @markus.bohland@hotmail.de ('SELECT STRING_AGG(name,''', ',',''') as columnName from sys.syscolumns where id =object_id(''', first(split(item(),'_')) ,'''' ,')' )

    In my implementation I have used a variable name to contruct the expression , i find this very helpful for debugging

    1. Pull up a Lookup activity and pass on the dynamic expression . The Lookup will return the column name of the actual table ( not _stg table )
    2. We will use a set variable for constructing the query ( with few columns ) the We will use a dynamic expression where we got the column name ( of the target table in our case flightData ) . The aim is to construct a dynamic query like . This is only used for debuging ( if you want we can remove this activity )

    SELECT col1,col2 FROM flightData_stg

    @markus.bohland@hotmail.de ( 'SELECT ' , activity('ColumnNamefromDB').output.firstRow.columnName , ' FROM ' , item())

    6 . Add a copy activity and we will use paramterized data , I think you may already used that in the initla implemenation .

    Source Query

    @markus.bohland@hotmail.de ( 'SELECT ' , activity('ColumnNamefromDB').output.firstRow.columnName , ' FROM ' , item())

    on the destination ( tables name are parameterized )
    Please use the dynamic expression

    @last lion (split(first(split(item(),'_')),'.'))
    The above expression will derieve the actual table name dbo.flightData from the staging table dbo.flightData_stg which we have a passed as an array ( image 1 on this reply )
    166603-dynamic-expression.txt
    <<Continued below >>.

    Thanks
    Himanshu


    1. Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    2. Want a reminder to come back and check responses? Here is how to subscribe to a notification
    3. If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators
    0 comments No comments