Azure Purview Data Catalog lineage user guide
This article provides an overview of the data lineage features in Azure Purview Data Catalog.
One of the platform features of Azure Purview is the ability to show the lineage between datasets created by data processes. Systems like Data Factory, Data Share, and Power BI capture the lineage of data as it moves. Custom lineage reporting is also supported via Atlas hooks and REST API.
Metadata collected in Azure Purview from enterprise data systems are stitched across to show an end to end data lineage. Data systems that collect lineage into Purview are broadly categorized into following three types.
Data processing system
Data integration and ETL tools can push lineage in to Azure Purview at execution time. Tools such as Data Factory, Data Share, Synapse, Azure Databricks, and so on, belong to this category of data systems. The data processing systems reference datasets as source from different databases and storage solutions to create target datasets. The list of data processing systems currently integrated with Purview for lineage are listed in below table.
|Data processing system||Supported scope|
|Azure Data Factory||Copy activity
Data flow activity
Execute SSIS package activity
|Azure Data Share||Share snapshot|
Data storage systems
Databases & storage solutions such as SQL Server, Teradata, and SAP have query engines to transform data using scripting language. Data lineage from stored procedures is collected in to Purview and stitched with lineage from other systems.
|Data storage system||Supported scope|
Data analytics & reporting systems
Data systems like Azure ML and Power BI report lineage into Azure Purview. These systems will use the datasets from storage systems and process through their meta model to create BI Dashboard, ML experiments and so on.
|Data analytics & reporting system||Supported scope|
|Power BI||Datasets, Dataflows, Reports & Dashboards|
Get started with lineage
Lineage in Purview includes datasets and processes. Datasets are also referred to as nodes while processes can be also called edges:
Dataset (Node): A dataset (structured or unstructured) provided as an input to a process. For example, a SQL Table, Azure blob, and files (such as .csv and .xml), are all considered datasets. In the lineage section of Purview, datasets are represented by rectangular boxes.
Process (Edge): An activity or transformation performed on a dataset is called a process. For example, ADF Copy activity, Data Share snapshot and so on. In the lineage section of Purview, processes are represented by round-edged boxes.
To access lineage information for an asset in Purview, follow the steps:
In the Azure portal, go to the Azure Purview accounts page.
Select your Azure Purview account from the list, and then select Launch purview account from the Overview page.
On the Azure Purview Home page, search for a dataset name or the process name such as ADF Copy or Data Flow activity. And then press Enter.
From the search results, select the asset and select its Lineage tab.
Azure Purview supports asset level lineage for the datasets and processes. To see the asset level lineage go to the Lineage tab of the current asset in the catalog. Select the current dataset asset node. By default the list of columns belonging to the data appears in the left pane.
Dataset column lineage
To see column-level lineage of a dataset, go to the Lineage tab of the current asset in the catalog and follow below steps:
Once you are in the lineage tab, in the left pane, select the check box next to each column you want to display in the data lineage.
Hover over a selected column on the left pane or in the dataset of the lineage canvas to see the column mapping. All the column instances are highlighted.
If the number of columns is larger than what can be displayed in the left pane, use the filter option to select a specific column by name. Alternatively, you can use your mouse to scroll through the list.
If the lineage canvas contains more nodes and edges, use the filter to select data asset or process nodes by name. Alternatively, you can use your mouse to pan around the lineage window.
Use the toggle in the left pane to highlight the list of datasets in the lineage canvas. If you turn off the toggle, any asset that contains at least one of the selected columns is displayed. If you turn on the toggle, only datasets that contain all of the columns are displayed.
Process column lineage
Data process can take one or more input datasets to produce one or more outputs. In Purview, column level lineage is available for process nodes.
Switch between input and output datasets from a drop down in the columns panel.
Select columns from one or more tables to see the lineage flowing from input dataset to corresponding output dataset.
Browse assets in lineage
Select Switch to asset on any asset to view its corresponding metadata from the lineage view. Doing so is an effective way to browse to another asset in the catalog from the lineage view.
The lineage canvas could become complex for popular datasets. To avoid clutter, the default view will only show five levels of lineage for the asset in focus. The rest of the lineage can be expanded by clicking the bubbles in the lineage canvas. Data consumers can also hide the assets in the canvas that are of no interest. To further reduce the clutter, turn off the toggle More Lineage at the top of lineage canvas. This action will hide all the bubbles in lineage canvas.
Use the smart buttons in the lineage canvas to get an optimal view of the lineage. Auto layout, Zoom to fit, Zoom in/out, Full screen, and navigation map are available for an immersive lineage experience in the catalog.