Register and scan Azure Blob Storage

This article outlines how to register an Azure Blob Storage account in Purview and set up a scan.

Supported capabilities

Azure Blob Storage supports full and incremental scans to capture the metadata and schema. It also classifies the data automatically based on system and custom classification rules.

For file types such as csv, tsv, psv, ssv, the schema is extracted when the following logics are in place:

  1. First row values are non-empty
  2. First row values are unique
  3. First row values are neither a date and nor a number

Prerequisites

  • Before registering data sources, create an Azure Purview account. For more information on creating a Purview account, see Quickstart: Create an Azure Purview account.
  • You need to be an Azure Purview Data Source Admin

Setting up authentication for a scan

There are three ways to set up authentication for Azure blob storage:

  • Managed Identity
  • Account Key
  • Service Principal

When you choose Managed Identity, to set up the connection, you must first give your Purview account the permission to scan the data source:

  1. Navigate to your storage account.
  2. Select Access Control (IAM) from the left navigation menu.
  3. Select + Add.
  4. Set the Role to Storage Blob Data Reader and enter your Azure Purview account name under Select input box. Then, select Save to give this role assignment to your Purview account.

Note

For more details, please see steps in Authorize access to blobs and queues using Azure Active Directory

Account Key

When authentication method selected is Account Key, you need to get your access key and store in the key vault:

  1. Navigate to your storage account
  2. Select Settings > Access keys
  3. Copy your key and save it somewhere for the next steps
  4. Navigate to your key vault
  5. Select Settings > Secrets
  6. Select + Generate/Import and enter the Name and Value as the key from your storage account
  7. Select Create to complete
  8. If your key vault is not connected to Purview yet, you will need to create a new key vault connection
  9. Finally, create a new credential using the key to setup your scan

Service principal

To use a service principal, you can use an existing one or create a new one.

Note

If you have to create a new Service Principal, please follow these steps:

  1. Navigate to the Azure portal.
  2. Select Azure Active Directory from the left-hand side menu.
  3. Select App registrations.
  4. Select + New application registration.
  5. Enter a name for the application (the service principal name).
  6. Select Accounts in this organizational directory only.
  7. For Redirect URI select Web and enter any URL you want; it doesn't have to be real or work.
  8. Then select Register.

It is required to get the Service Principal's application ID and secret:

  1. Navigate to your Service Principal in the Azure portal
  2. Copy the values the Application (client) ID from Overview and Client secret from Certificates & secrets.
  3. Navigate to your key vault
  4. Select Settings > Secrets
  5. Select + Generate/Import and enter the Name of your choice and Value as the Client secret from your Service Principal
  6. Select Create to complete
  7. If your key vault is not connected to Purview yet, you will need to create a new key vault connection
  8. Finally, create a new credential using the Service Principal to setup your scan

Granting the Service Principal access to your blob storage

  1. Navigate to your storage account.
  2. Select Access Control (IAM) from the left navigation menu.
  3. Select + Add.
  4. Set the Role to Storage Blob Data Reader and enter your service principal name or object ID under Select input box. Then, select Save to give this role assignment to your service principal.

Firewall settings

Note

If you have firewall enabled for the storage account, you must use Managed Identity authentication method when setting up a scan.

  1. Go into your storage account in Azure portal
  2. Navigate to Settings > Networking and
  3. Choose Selected Networks under Allow access from
  4. In the Firewall section, select Allow trusted Microsoft services to access this storage account and hit Save

Screenshot showing firewall setting

Register an Azure Blob Storage account

To register a new blob account in your data catalog, do the following:

  1. Navigate to the Purview Studio from your Purview account in the portal.
  2. Select Register Sources on the home page of the Purview Studio.
  3. Select Register
  4. On Register sources, select Azure Blob Storage
  5. Select Continue

On the Register sources (Azure Blob Storage) screen, do the following:

  1. Enter a Name that the data source will be listed with in the Catalog.
  2. Choose your subscription to filter down storage accounts.
  3. Select a storage account.
  4. Select a collection or create a new one (Optional).
  5. Select Register to register the data source.

register sources options

Creating and running a scan

To create and run a new scan, do the following:

  1. Select the Data Map tab on the left pane in the Purview Studio.

  2. Select the Azure Blob data source that you registered.

  3. Select New scan

  4. Select the credential to connect to your data source.

    Set up scan

  5. You can scope your scan to specific folders or subfolders by choosing the appropriate items in the list.

    Scope your scan

  6. Then select a scan rule set. You can choose between the system default, existing custom rule sets, or create a new rule set inline.

    Scan rule set

  7. Choose your scan trigger. You can set up a schedule or run the scan once.

    trigger

  8. Review your scan and select Save and run.

Viewing your scans and scan runs

To view existing scans, do the following:

  1. Go to the Purview studio. Select the Data Map tab under the left pane.

  2. Select the desired data source. You will see a list of existing scans on that data source under Recent scans, or can view all scans under the Scans tab.

  3. Select the scan that has results you want to view.

  4. This page will show you all of the previous scan runs along with the status and metrics for each scan run. It will also display whether your scan was scheduled or manual, how many assets had classifications applied, how many total assets were discovered, the start and end time of the scan, and the total scan duration.

Manage your scans - edit, delete, or cancel

To manage or delete a scan, do the following:

  1. Go to the Purview studio. Select the Data Map tab under the left pane.

  2. Select the desired data source. You will see a list of existing scans on that data source under Recent scans, or can view all scans under the Scans tab.

  3. Select the scan you would like to manage. You can edit the scan by selecting Edit scan.

  4. You can cancel an in progress scan by selecting Cancel scan run.

  5. You can delete your scan by selecting Delete scan.

Next steps