Connect to storage by using identity-based data access

In this article, you learn how to connect to storage services on Azure by using identity-based data access and Azure Machine Learning datastores via the Azure Machine Learning SDK for Python.

Typically, datastores use credential-based data access to confirm you have permission to access the storage service. They keep connection information, like your subscription ID and token authorization, in the key vault that's associated with the workspace. When you create a datastore that uses identity-based data access, your Azure account (Azure Active Directory token) is used to confirm you have permission to access the storage service. In this scenario, no authentication credentials are saved. Only the storage account information is stored in the datastore.

To create datastores that use credential-based authentication, like access keys or service principals, see Connect to storage services on Azure.

Identity-based data access in Azure Machine Learning

There are two scenarios in which you can apply identity-based data access in Azure Machine Learning. These scenarios are a good fit for identity-based access when you're working with confidential data and need more granular data access management:

  • Accessing storage services
  • Training machine learning models with private data

Accessing storage services

You can connect to storage services via identity-based data access with Azure Machine Learning datastores or Azure Machine Learning datasets.

Your authentication credentials are usually kept in a datastore, which is used to ensure you have permission to access the storage service. When these credentials are registered via datastores, any user with the workspace Reader role can retrieve them. That scale of access can be a security concern for some organizations. Learn more about the workspace Reader role.

When you use identity-based data access, Azure Machine Learning prompts you for your Azure Active Directory token for data access authentication instead of keeping your credentials in the datastore. That approach allows for data access management at the storage level and keeps credentials confidential.

The same behavior applies when you:

Note

Credentials stored via credential-based authentication include subscription IDs, shared access signature (SAS) tokens, and storage access key and service principal information, like client IDs and tenant IDs.

Model training on private data

Certain machine learning scenarios involve training models with private data. In such cases, data scientists need to run training workflows without being exposed to the confidential input data. In this scenario, a managed identity of the training compute is used for data access authentication. This approach allows storage admins to grant Storage Blob Data Reader access to the managed identity that the training compute uses to run the training job. The individual data scientists don't need to be granted access. For more information, see Set up managed identity on a compute cluster.

Prerequisites

Storage access permissions

To help ensure that you securely connect to your storage service on Azure, Azure Machine Learning requires that you have permission to access the corresponding data storage.

Identity-based data access supports connections to only the following storage services:

  • Azure Blob Storage
  • Azure Data Lake Storage Gen1
  • Azure Data Lake Storage Gen2
  • Azure SQL Database

To access these storage services, you must have at least Storage Blob Data Reader access. Only storage account owners can change your access level via the Azure portal.

If you're training a model on a remote compute target, the compute identity must be granted at least the Storage Blob Data Reader role from the storage service. Learn how to set up managed identity on a compute cluster.

Work with virtual networks

By default, Azure Machine Learning can't communicate with a storage account that's behind a firewall or in a virtual network.

You can configure storage accounts to allow access only from within specific virtual networks. This configuration requires additional steps to ensure data isn't leaked outside of the network. This behavior is the same for credential-based data access. For more information, see How to configure virtual network scenarios.

Create and register datastores

When you register a storage service on Azure as a datastore, you automatically create and register that datastore to a specific workspace. See Storage access permissions for guidance on required permission types. See Work with virtual networks for details on how to connect to data storage behind virtual networks.

In the following code, notice the absence of authentication parameters like sas_token, account_key, subscription_id, and the service principal client_id. This omission indicates that Azure Machine Learning will use identity-based data access for authentication. Creation of datastores typically happens interactively in a notebook or via the studio. So your Azure Active Directory token is used for data access authentication.

Note

Datastore names should consist only of lowercase letters, numbers, and underscores.

Azure blob container

To register an Azure blob container as a datastore, use register_azure_blob_container().

The following code creates the credentialless_blob datastore, registers it to the ws workspace, and assigns it to the blob_datastore variable. This datastore accesses the my_container_name blob container on the my-account-name storage account.

# Create blob datastore without credentials.
blob_datastore = Datastore.register_azure_blob_container(workspace=ws,
                                                      datastore_name='credentialless_blob',
                                                      container_name='my_container_name',
                                                      account_name='my_account_name')

Azure Data Lake Storage Gen1

Use register_azure_data_lake() to register a datastore that connects to Azure Data Lake Storage Gen1.

The following code creates the credentialless_adls1 datastore, registers it to the workspace workspace, and assigns it to the adls_dstore variable. This datastore accesses the adls_storage Azure Data Lake Storage account.

# Create Azure Data Lake Storage Gen1 datastore without credentials.
adls_dstore = Datastore.register_azure_data_lake(workspace = workspace,
                                                 datastore_name='credentialless_adls1',
                                                 store_name='adls_storage')

Azure Data Lake Storage Gen2

Use register_azure_data_lake_gen2() to register a datastore that connects to Azure Data Lake Storage Gen2.

The following code creates the credentialless_adls2 datastore, registers it to the ws workspace, and assigns it to the adls2_dstore variable. This datastore accesses the file system tabular in the myadls2 storage account.

# Create Azure Data Lake Storage Gen2 datastore without credentials.
adls2_dstore = Datastore.register_azure_data_lake_gen2(workspace=ws, 
                                                       datastore_name='credentialless_adls2', 
                                                       filesystem='tabular', 
                                                       account_name='myadls2')

Azure SQL database

For an Azure SQL database, use register_azure_sql_database() to register a datastore that connects to an Azure SQL database storage.

The following code creates and registers the credentialless_sqldb datastore to the ws workspace and assigns it to the variable, sqldb_dstore. This datastore accesses the database mydb in the myserver SQL DB server.

# createn sqldatabase datastore without credentials
                                                       
sqldb_dstore = Datastore.register_azure_sql_database(workspace=ws,
                                                       datastore_name='credentialless_sqldb',
                                                       server_name='myserver',
                                                       database_name='mydb')                                                       
                                                   

Use data in storage

We recommend that you use Azure Machine Learning datasets when you interact with your data in storage with Azure Machine Learning.

Important

Datasets using identity-based data access is not supported for automated ML experiments.

Datasets package your data into a lazily evaluated consumable object for machine learning tasks like training. Also, with datasets you can download or mount files of any format from Azure storage services like Azure Blob Storage and Azure Data Lake Storage to a compute target.

To create datasets with identity-based data access, you have the following options. This type of dataset creation uses your Azure Active Directory token for data access authentication.

  • Reference paths from datastores that also use identity-based data access.
    In the following example, blob_datastore already exists and uses identity-based data access.

    blob_dataset = Dataset.Tabular.from_delimited_files(blob_datastore,'test.csv') 
    
  • Skip datastore creation and create datasets directly from storage URLs. This functionality currently supports only Azure blobs and Azure Data Lake Storage Gen1 and Gen2.

    blob_dset = Dataset.File.from_files('https://myblob.blob.core.windows.net/may/keras-mnist-fashion/')
    

When you submit a training job that consumes a dataset created with identity-based data access, the managed identity of the training compute is used for data access authentication. Your Azure Active Directory token isn't used. For this scenario, ensure that the managed identity of the compute is granted at least the Storage Blob Data Reader role from the storage service. For more information, see Set up managed identity on compute clusters.

Next steps