Exercise - Use Azure Data Factory wrangling data

Completed

The Power Query feature within Azure Data Factory enables you to work with and wrangle data. It is an object that can be added to the canvas designer as an activity in an Azure Data Factory pipeline to perform code free data preparation. It enables individuals who are not conversant with the traditional data preparation technologies such as Spark or SQL Server, and languages such as Python and T-SQL to prepare data at cloud scale iteratively.

The Power Query feature uses a grid type interface for basic data preparation that is like the aesthetics of Excel, known as an Online Mashup Editor. The editor also enables more advanced users to perform more complex data preparation using formulas. You first have to create a linked service to a source of the data first before you are able to access the data

Online Mashup Editor

The formulas work with Power Query Online and makes Power Query M functions available for data factory users. Power Query then translates the M language generated by the Online Mashup Editor into spark code for cloud scale execution.

This capability enables both data engineers and data analysts to interactively explore and prepare datasets. In addition, they can interactively work with the M language and preview the result before viewing it in the context of a wider pipeline.

To add a Power Query activity in Azure Data Factory, click the plus icon and select Power Query in the factory resources pane.

Create a Power Query activity

Add a Source dataset for your wrangling data flow, and select a sink dataset. The following data sources are supported.

Connector Data format Authentication type
Azure Blob Storage CSV, Parquet Account Key
Azure Data Lake Storage Gen1 CSV Service Principal
Azure Data Lake Storage Gen2 CSV, Parquet Account Key, Service Principal
Azure SQL Database SQL authentication
Azure Synapse Analytics SQL authentication

Once you have selected a source, then click on create.

Adding source datasets to wrangling data flow

This opens the Online Mashup Editor.

Navigating the wrangling data flow

It consists of the following components:

  1. Dataset list.

    This will provide the datasets that have been defined as the source for the Data Wrangling.

  2. Wrangling Function toolbar.

    The toolbar contains a variety of data wrangling functions that the user can access to manipulate the data including:

    • Managing columns.
    • Transforming tables.
    • Reducing rows.
    • Adding columns.
    • Combining tables.

    Each item is context-sensitive and contains sub functions specific to it.

  3. Column headings.

    As well as having the ability to rename columns, right-clicking the column will bring up context-sensitive items for managing columns.

  4. Settings.

    This enables you to add or edit data sources and data sinks, and modify setting for the wrangling data task.

  5. Steps window.

    This window shows the steps that have been applied to the wrangling output. In the example in the graphic, the step named “Source” has been applied the wrangling output named “UserQuery”.

  6. Power Query output list.

    Lists the data wrangling output that has been defined.

  7. Publish button.

    Enables you to publish the work that has been created.

A Power Query task can be added in the canvass designer just like a Copy Activity task, or a Mapping Data Flow task and can be managed and monitored in the same way.

completing the wrangling data flow