Connect to Azure Data Lake Storage in Microsoft Purview

This article outlines the process to register an Azure Data Lake Storage (ADLS Gen2) data source in Microsoft Purview including instructions to authenticate and interact with the ADLS Gen2 source.

Supported capabilities

Metadata Extraction Full Scan Incremental Scan Scoped Scan Classification Access Policy Lineage Data Sharing
Yes Yes Yes Yes Yes Yes (preview) Limited** Yes

** Lineage is supported if dataset is used as a source/sink in Data Factory Copy activity

Prerequisites

Register

This section will enable you to register the ADLS Gen2 data source for scan and data share in Purview.

Prerequisites for register

  • You'll need to be a Data Source Admin and one of the other Purview roles (for example, Data Reader or Data Share Contributor) to register a source and manage it in the Microsoft Purview governance portal. See our Microsoft Purview Permissions page for details.

Steps to register

It's important to register the data source in Microsoft Purview prior to setting up a scan for the data source.

  1. Go to the Azure portal, and navigate to the Microsoft Purview accounts page and select your Purview account

  2. Open Microsoft Purview governance portal and navigate to the Data Map --> Sources

    Screenshot that shows the link to open Microsoft Purview governance portal

    Screenshot that navigates to the Sources link in the Data Map

  3. Create the Collection hierarchy using the Collections menu and assign permissions to individual subcollections, as required

    Screenshot that shows the collection menu to create collection hierarchy

  4. Navigate to the appropriate collection under the Sources menu and select the Register icon to register a new ADLS Gen2 data source

    Screenshot that shows the collection used to register the data source

  5. Select the Azure Data Lake Storage Gen2 data source and select Continue

    Screenshot that allows selection of the data source

  6. Provide a suitable Name for the data source, select the relevant Azure subscription, existing Data Lake Store account name and the collection and select Apply. Leave the Data Use Management toggle on the disabled position until you have a chance to carefully go over this document.

    Screenshot that shows the details to be entered in order to register the data source

  7. The ADLS Gen2 storage account will be shown under the selected Collection

    Screenshot that shows the data source mapped to the collection to initiate scanning

Scan

Tip

To troubleshoot any issues with scanning:

  1. Confirm you have followed all prerequisites for scanning.
  2. Review our scan troubleshooting documentation.

Prerequisites for scan

In order to have access to scan the data source, an authentication method in the ADLS Gen2 Storage account needs to be configured. The following options are supported:

Note

If you have firewall enabled for the storage account, you must use managed identity authentication method when setting up a scan.

  • System-assigned managed identity (Recommended) - As soon as the Microsoft Purview Account is created, a system-assigned managed identity (SAMI) is created automatically in Azure AD tenant. Depending on the type of resource, specific RBAC role assignments are required for the Microsoft Purview system-assigned managed identity (SAMI) to perform the scans.

  • User-assigned managed identity (preview) - Similar to a system managed identity, a user-assigned managed identity (UAMI) is a credential resource that can be used to allow Microsoft Purview to authenticate against Azure Active Directory. For more information, you can see our User-assigned managed identity guide.

  • Account Key - Secrets can be created inside an Azure Key Vault to store credentials in order to enable access for Microsoft Purview to scan data sources securely using the secrets. A secret can be a storage account key, SQL login password, or a password.

    Note

    If you use this option, you need to deploy an Azure key vault resource in your subscription and assign Microsoft Purview account’s SAMI with required access permission to secrets inside Azure key vault.

  • Service Principal - In this method, you can create a new or use an existing service principal in your Azure Active Directory tenant.

Authentication for a scan

Using a system or user assigned managed identity for scanning

It's important to give your Microsoft Purview account or user-assigned managed identity (UAMI) the permission to scan the ADLS Gen2 data source. You can add your Microsoft Purview account's system-assigned managed identity (which has the same name as your Microsoft Purview account) or UAMI at the Subscription, Resource Group, or Resource level, depending on what level scan permissions are needed.

Note

You need to be an owner of the subscription to be able to add a managed identity on an Azure resource.

  1. From the Azure portal, find either the subscription, resource group, or resource (for example, an Azure Data Lake Storage Gen2 storage account) that you would like to allow the catalog to scan.

    Screenshot that shows the storage account

  2. Select Access Control (IAM) in the left navigation and then select + Add --> Add role assignment

    Screenshot that shows the access control for the storage account

  3. Set the Role to Storage Blob Data Reader and enter your Microsoft Purview account name or user-assigned managed identity under the Select input box. Then, select Save to give this role assignment to your Microsoft Purview account.

    Screenshot that shows the details to assign permissions for the Microsoft Purview account

    Note

    For more details, please see steps in Authorize access to blobs and queues using Azure Active Directory

    Note

    If you have firewall enabled for the storage account, you must use managed identity authentication method when setting up a scan.

  4. Go into your ADLS Gen2 storage account in Azure portal

  5. Navigate to Security + networking > Networking

    Screenshot that shows the details to provide firewall access

  6. Choose Selected Networks under Allow access from

    Screenshot that shows the details to allow access to selected networks

  7. In the Exceptions section, select Allow trusted Microsoft services to access this storage account and hit Save

    Screenshot that shows the exceptions to allow trusted Microsoft services to access the storage account

Create the scan

  1. Open your Microsoft Purview account and select the Open Microsoft Purview governance portal

  2. Navigate to the Data map --> Sources to view the collection hierarchy

  3. Select the New Scan icon under the ADLS Gen2 data source registered earlier

    Screenshot that shows the screen to create a new scan

If using a system or user assigned managed identity

  1. Provide a Name for the scan, select the system-assigned or user-assigned managed identity under Credential, choose the appropriate collection for the scan, and select Test connection. On a successful connection, select Continue.

    Screenshot that shows the managed identity option to run the scan

Scope and run the scan

  1. You can scope your scan to specific folders and subfolders by choosing the appropriate items in the list.

    Scope your scan

  2. Then select a scan rule set. You can choose between the system default, existing custom rule sets, or create a new rule set inline.

    Scan rule set

  3. If creating a new scan rule set, select the file types to be included in the scan rule.

    Scan rule set file types

  4. You can select the classification rules to be included in the scan rule

    Scan rule set classification rules

    Scan rule set selection

  5. Choose your scan trigger. You can set up a schedule or run the scan once.

    scan trigger

  6. Review your scan and select Save and run.

    review scan

View your scans and scan runs

To view existing scans, do the following:

  1. Go to the Microsoft Purview governance portal. Select the Data Map tab under the left pane.

  2. Select the desired data source. You will see a list of existing scans on that data source under Recent scans, or can view all scans under the Scans tab.

  3. Select the scan that has results you want to view.

  4. This page will show you all of the previous scan runs along with the status and metrics for each scan run. It will also display whether your scan was scheduled or manual, how many assets had classifications applied, how many total assets were discovered, the start and end time of the scan, and the total scan duration.

Manage your scans - edit, delete, or cancel

To manage or delete a scan, do the following:

  1. Go to the Microsoft Purview governance portal. Select the Data Map tab under the left pane.

  2. Select the desired data source. You will see a list of existing scans on that data source under Recent scans, or can view all scans under the Scans tab.

  3. Select the scan you would like to manage. You can edit the scan by selecting Edit scan.

  4. You can cancel an in progress scan by selecting Cancel scan run.

  5. You can delete your scan by selecting Delete scan.

Note

  • Deleting your scan does not delete catalog assets created from previous scans.
  • The asset will no longer be updated with schema changes if your source table has changed and you re-scan the source table after editing the description in the schema tab of Microsoft Purview.

Data sharing

Microsoft Purview Data Sharing (preview) enables sharing of data in-place from ADLS Gen2 to ADLS Gen2. This section provides details about the ADLS Gen2 specific requirements for sharing and receiving data in-place. Refer to How to share data and How to receive share for step by step guide on how to use data share.

Storage accounts supported for in-place data sharing

The following storage accounts are supported for in-place data sharing:

  • Regions: Canada Central, Canada East, UK South, UK West, Australia East, Japan East, Korea South, and South Africa North
  • Redundancy options: LRS, GRS, RA-GRS
  • Tiers: Hot, Cool

Only use storage account without production workload for the preview.

Note

Source and target storage accounts must be in the same region as each other. They don't need to be in the same region as the Microsoft Purview account.

Storage account permissions required to share data

To add or update a storage account asset to a share, you need ONE of the following permissions:

  • Microsoft.Authorization/roleAssignments/write - This permission is available in the Owner role.
  • Microsoft.Storage/storageAccounts/blobServices/containers/blobs/modifyPermissions/ - This permission is available in the Blob Storage Data Owner role.

Storage account permissions required to receive shared data

To map a storage account asset in a received share, you need ONE of the following permissions:

  • Microsoft.Storage/storageAccounts/write - This permission is available in the Contributor and Owner role.
  • Microsoft.Storage/storageAccounts/blobServices/containers/write - This permission is available in the Contributor, Owner, Storage Blob Data Contributor and Storage Blob Data Owner role.

Update shared data in source storage account

Updates you make to shared files or data in the shared folder from source storage account will be made available to recipient in target storage account in near real time. When you delete subfolder or files within the shared folder, they'll disappear for recipient. To delete the shared folder, file or parent folders or containers, you need to first revoke access to all your shares from the source storage account.

Access shared data in target storage account

The target storage account enables recipient to access the shared data read-only in near real time. You can connect analytics tools such as Synapse Workspace and Databricks to the shared data to perform analytics. Cost of accessing the shared data is charged to the target storage account.

Service limit

Source storage account can support up to 20 targets, and target storage account can support up to 100 sources. If you require an increase in limit, contact Support.

Access policy

To create an access policy for Azure Data Lake Storage Gen 2, follow these guides:

Next steps

Now that you've registered your source, follow the below guides to learn more about Microsoft Purview and your data.