Connect to Azure Data Lake Gen2 in Azure Purview

This article outlines the process to register an Azure Data Lake Storage Gen2 data source in Azure Purview including instructions to authenticate and interact with the Azure Data Lake Storage Gen2 source

Supported capabilities

Metadata Extraction Full Scan Incremental Scan Scoped Scan Classification Access Policy Lineage
Yes Yes Yes Yes Yes Yes Limited**

** Lineage is supported if dataset is used as a source/sink in Data Factory Copy activity

Prerequisites

Register

This section will enable you to register the ADLS Gen2 data source and set up an appropriate authentication mechanism to ensure successful scanning of the data source.

Steps to register

It is important to register the data source in Azure Purview prior to setting up a scan for the data source.

  1. Go to the Azure portal, and navigate to the Purview accounts page and select your Purview account

    Screenshot that shows the Purview account used to register the data source

  2. Open Purview Studio and navigate to the Data Map --> Sources

    Screenshot that shows the link to open Purview Studio

    Screenshot that navigates to the Sources link in the Data Map

  3. Create the Collection hierarchy using the Collections menu and assign permissions to individual subcollections, as required

    Screenshot that shows the collection menu to create collection hierarchy

  4. Navigate to the appropriate collection under the Sources menu and select the Register icon to register a new ADLS Gen2 data source

    Screenshot that shows the collection used to register the data source

  5. Select the Azure Data Lake Storage Gen2 data source and select Continue

    Screenshot that allows selection of the data source

  6. Provide a suitable Name for the data source, select the relevant Azure subscription, existing Data Lake Store account name and the collection and select Apply

    Screenshot that shows the details to be entered in order to register the data source

  7. The ADLS Gen2 storage account will be shown under the selected Collection

    Screenshot that shows the data source mapped to the collection to initiate scanning

Scan

Prerequisites for scan

In order to have access to scan the data source, an authentication method in the ADLS Gen2 Storage account needs to be configured. The following options are supported:

Note

If you have firewall enabled for the storage account, you must use managed identity authentication method when setting up a scan.

  • System-assigned managed identity (Recommended) - As soon as the Azure Purview Account is created, a system-assigned managed identity (SAMI) is created automatically in Azure AD tenant. Depending on the type of resource, specific RBAC role assignments are required for the Azure Purview system-assigned managed identity (SAMI) to perform the scans.

  • User-assigned managed identity (preview) - Similar to a system managed identity, a user-assigned managed identity (UAMI) is a credential resource that can be used to allow Azure Purview to authenticate against Azure Active Directory. For more information, you can see our User-assigned managed identity guide.

  • Account Key - Secrets can be created inside an Azure Key Vault to store credentials in order to enable access for Azure Purview to scan data sources securely using the secrets. A secret can be a storage account key, SQL login password, or a password.

    Note

    If you use this option, you need to deploy an Azure key vault resource in your subscription and assign Azure Purview account’s SAMI with required access permission to secrets inside Azure key vault.

  • Service Principal - In this method, you can create a new or use an existing service principal in your Azure Active Directory tenant.

Authentication for a scan

Using a system or user assigned managed identity for scanning

It is important to give your Purview account or user-assigned managed identity (UAMI) the permission to scan the ADLS Gen2 data source. You can add your Purview account's system-assigned managed identity (which has the same name as your Purview account) or UAMI at the Subscription, Resource Group, or Resource level, depending on what level scan permissions are needed.

Note

You need to be an owner of the subscription to be able to add a managed identity on an Azure resource.

  1. From the Azure portal, find either the subscription, resource group, or resource (for example, an Azure Data Lake Storage Gen2 storage account) that you would like to allow the catalog to scan.

    Screenshot that shows the storage account

  2. Select Access Control (IAM) in the left navigation and then select + Add --> Add role assignment

    Screenshot that shows the access control for the storage account

  3. Set the Role to Storage Blob Data Reader and enter your Azure Purview account name or user-assigned managed identity under the Select input box. Then, select Save to give this role assignment to your Purview account.

    Screenshot that shows the details to assign permissions for the Purview account

Note

For more details, please see steps in Authorize access to blobs and queues using Azure Active Directory

Note

If you have firewall enabled for the storage account, you must use managed identity authentication method when setting up a scan.

  1. Go into your ADLS Gen2 storage account in Azure portal

  2. Navigate to Security + networking > Networking

    Screenshot that shows the details to provide firewall access

  3. Choose Selected Networks under Allow access from

    Screenshot that shows the details to allow access to selected networks

  4. In the Exceptions section, select Allow trusted Microsoft services to access this storage account and hit Save

    Screenshot that shows the exceptions to allow trusted Microsoft services to access the storage account

Using Account Key for scanning

When authentication method selected is Account Key, you need to get your access key and store in the key vault:

  1. Navigate to your ADLS Gen2 storage account

  2. Select Security + networking > Access keys

    Screenshot that shows the access keys in the storage account

  3. Copy your key and save it separately for the next steps

    Screenshot that shows the access keys to be copied

  4. Navigate to your key vault

    Screenshot that shows the key vault

  5. Select Settings > Secrets and select + Generate/Import Screenshot that shows the key vault option to generate a secret

  6. Enter the Name and Value as the key from your storage account

    Screenshot that shows the key vault option to enter the secret values

  7. Select Create to complete

    Screenshot that shows the key vault option to create a secret

  8. If your key vault is not connected to Purview yet, you will need to create a new key vault connection

  9. Finally, create a new credential using the key to set up your scan

Using Service Principal for scanning

Creating a new service principal

If you need to Create a new service principal, it is required to register an application in your Azure AD tenant and provide access to Service Principal in your data sources. Your Azure AD Global Administrator or other roles such as Application Administrator can perform this operation.

Getting the Service Principal's Application ID
  1. Copy the Application (client) ID present in the Overview of the Service Principal already created

    Screenshot that shows the Application (client) ID for the Service Principal

Granting the Service Principal access to your ADLS Gen2 account

It is important to give your service principal the permission to scan the ADLS Gen2 data source. You can add access for the service principal at the Subscription, Resource Group, or Resource level, depending on what level scan permissions are needed.

Note

You need to be an owner of the subscription to be able to add a service principal on an Azure resource.

  1. From the Azure portal, find either the subscription, resource group, or resource (for example, an Azure Data Lake Storage Gen2 storage account) that you would like to allow the catalog to scan.

    Screenshot that shows the storage account

  2. Select Access Control (IAM) in the left navigation and then select + Add --> Add role assignment

    Screenshot that shows the access control for the storage account

  3. Set the Role to Storage Blob Data Reader and enter your service principal under Select input box. Then, select Save to give this role assignment to your Purview account.

    Screenshot that shows the details to provide storage account permissions to the service principal

Create the scan

  1. Open your Purview account and select the Open Purview Studio

  2. Navigate to the Data map --> Sources to view the collection hierarchy

  3. Select the New Scan icon under the ADLS Gen2 data source registered earlier

    Screenshot that shows the screen to create a new scan

If using a system or user assigned managed identity

  1. Provide a Name for the scan, select the system-assigned or user-assigned managed identity under Credential, choose the appropriate collection for the scan, and select Test connection. On a successful connection, select Continue.

    Screenshot that shows the managed identity option to run the scan

If using Account Key

  1. Provide a Name for the scan, choose the appropriate collection for the scan, and select Authentication method as Account Key

    Screenshot that shows the Account Key option for scanning

If using Service Principal

  1. Provide a Name for the scan, choose the appropriate collection for the scan, and select the + New under Credential

    Screenshot that shows the option for service principal to enable scanning

  2. Select the appropriate Key vault connection and the Secret name that was used while creating the Service Principal. The Service Principal ID is the Application (client) ID copied earlier.

    Screenshot that shows the service principal option

  3. Select Test connection. On a successful connection, select Continue

Scope and run the scan

  1. You can scope your scan to specific folders and subfolders by choosing the appropriate items in the list.

    Scope your scan

  2. Then select a scan rule set. You can choose between the system default, existing custom rule sets, or create a new rule set inline.

    Scan rule set

  3. If creating a new scan rule set, select the file types to be included in the scan rule.

    Scan rule set file types

  4. You can select the classification rules to be included in the scan rule

    Scan rule set classification rules

    Scan rule set selection

  5. Choose your scan trigger. You can set up a schedule or run the scan once.

    scan trigger

  6. Review your scan and select Save and run.

    review scan

View your scans and scan runs

To view existing scans, do the following:

  1. Go to the Purview Studio. Select the Data Map tab under the left pane.

  2. Select the desired data source. You will see a list of existing scans on that data source under Recent scans, or can view all scans under the Scans tab.

  3. Select the scan that has results you want to view.

  4. This page will show you all of the previous scan runs along with the status and metrics for each scan run. It will also display whether your scan was scheduled or manual, how many assets had classifications applied, how many total assets were discovered, the start and end time of the scan, and the total scan duration.

Manage your scans - edit, delete, or cancel

To manage or delete a scan, do the following:

  1. Go to the Purview Studio. Select the Data Map tab under the left pane.

  2. Select the desired data source. You will see a list of existing scans on that data source under Recent scans, or can view all scans under the Scans tab.

  3. Select the scan you would like to manage. You can edit the scan by selecting Edit scan.

  4. You can cancel an in progress scan by selecting Cancel scan run.

  5. You can delete your scan by selecting Delete scan.

Note

  • Deleting your scan does not delete catalog assets created from previous scans.
  • The asset will no longer be updated with schema changes if your source table has changed and you re-scan the source table after editing the description in the schema tab of Purview.

Access policy

Supported regions

Azure Purview (management side)

The Purview access policies capability is available in all Azure Purview regions

Azure Storage (enforcement side)

Purview access policies can only be enforced in the following Azure Storage regions

  • France Central
  • Canada Central

Enable access policy enforcement for the Azure Storage account

The following PowerShell commands need to be executed in the subscription where the Azure Storage account resides. This will cover all Azure Storage accounts in that subscription.

# Install the Az module
Install-Module -Name Az -Scope CurrentUser -Repository PSGallery -Force
# Login into the subscription
Connect-AzAccount -Subscription <SubscriptionID>
# Register the feature
Register-AzProviderFeature -FeatureName AllowPurviewPolicyEnforcement -ProviderNamespace Microsoft.Storage

If the output of the last command shows value of “RegistrationState” as “Registered”, then your subscription is enabled for this functionality.

Follow this configuration guide to enable access policies on an Azure Storage account

Next steps

Now that you have registered your source, follow the below guides to learn more about Purview and your data.