Data administration

Learn how to manage data access and how to authenticate in Azure Machine Learning

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

Important

This article is intended for Azure administrators who want to create the required infrastructure for an Azure Machine Learning solution.

Credential-based data authentication

In general, credential-based data authentication involves these checks:

  • Does the user who is accessing data from the credential-based datastore have been assigned an RBAC role containing Microsoft.MachineLearningServices/workspaces/datastores/listsecrets/action?

    • This permission is required to retrieve credentials from the datastore on behalf of the user.
    • Built in roles that contain this permission already is the Contributor, the Azure AI Developer or the AML Data Scientist roles. Alternatively, if a custom role is being applied then we need to ensure that this permission is added to that custom role.
    • You must know which specific user is trying to access the data. It can be a real user with user identity or a compute with compute MSI etc., you can check the section Scenarios and authentication options to identify what is the identity that you need to add permission for.
  • Does the stored credential (service principal, account key, or sas token) have access to the data resource?

Identity-based data authentication

In general, identity-based data authentication involves these checks:

  • Which user wants to access the resources?
    • Depending on the conext when the data is being accessed, different types of authentication are available, for example
      • user identity
      • compute managed identity
      • workspace managed identity
    • Jobs, including the dataset "Generate Profile" option, run on a compute resource in your subscription, and access the data from that location. The compute managed identity needs permission to the storage resource, instead of the identity of the user that submitted the job.
    • For authentication based on a user identity, you must know which specific user tried to access the storage resource. For more information about user authentication, see authentication for Azure Machine Learning. For more information about service-level authentication, see authentication between Azure Machine Learning and other services.
  • Does this user have permission for reading?
    • Does the user identity or the compute managed identity, etc., have the necessary permissions for that storage resource? Permissions are granted using Azure role-based access controls (Azure RBAC).
    • The storage account Reader reads the storage metadata.
    • The Storage Blob Data Reader reads and lists Blob storage containers and blobs.
    • Please find more Azure built-in roles for storage here.
  • Does this user have permission for writing?
    • Does the user identity or the compute managed identity, etc., have the necessary permissions for that storage resource? Permissions are granted using Azure role-based access controls (Azure RBAC).
    • The storage account Reader reads the storage metadata.
    • The Storage Blob Data Contributor reads, writes, and deletes Azure Storage containers and blobs.
    • Please find more Azure built-in roles for storage here.

Other general checks for authentication

  • Where does the access come from?
    • User: Is the client IP address in the VNet/subnet range?
    • Workspace: Is the workspace public, or does it have a private endpoint in a VNet/subnet?
    • Storage: Does the storage allow public access, or does it restrict access through a service endpoint or a private endpoint?
  • What operation will be performed?
    • Azure Machine Learning handles create, read, update, and delete (CRUD) operations on a data store/dataset.
    • Archive operations on data assets in the studio require this RBAC operation: Microsoft.MachineLearningServices/workspaces/datasets/registered/delete
    • Data Access calls (for example, preview or schema) go to the underlying storage, and need extra permissions.
  • Will this operation run in your Azure subscription compute resources, or resources hosted in a Microsoft subscription?
    • All calls to dataset and datastore services (except the "Generate Profile" option) use resources hosted in a Microsoft subscription to run the operations.
    • Jobs, including the dataset "Generate Profile" option, run on a compute resource in your subscription, and access the data from that location. The compute identity needs permission to the storage resource, instead of the identity of the user that submitted the job.

This diagram shows the general flow of a data access call. Here, a user tries to make a data access call through a machine learning workspace, without using a compute resource.

Diagram of the logic flow when accessing data.

Scenarios and authentication options

This table lists the identities to use for specific scenarios:

Configuration SDK Local/Notebook VM Job Dataset Preview Datastore Browse
Credential + Workspace MSI Credential Credential Workspace MSI Credential (Only Account key and SAS token)
No Credential + Workspace MSI Compute MSI/User Identity Compute MSI/User identity Workspace MSI User identity
Credential + No Workspace MSI Credential Credential Credential(Not supported for Dataset Preview under private network) Credential (Only Account key and SAS token)
No Credential + No Workspace MSI Compute MSI/User Identity Compute MSI/User identity User Identity User Identity

For SDK V1, data authentication in a job is always using compute MSI. And for SDK V2, data authentication in a job depends on the job setting: can be user identity or compute MSI based on your setting.

Tip

To access data from outside Azure Machine Learning, for example with Azure Storage Explorer, that access probably relies on the user identity. For specific information, review the documentation for the tool or service you're using. For more information about how Azure Machine Learning works with data, see Setup authentication between Azure Machine Learning and other services.

VNET specific requirements

The following will help you set up data authentication to access data behind VNET from an Azure Machine Learning workspace.

Add permissions of Azure Storage Account to Azure Machine Learning workspace managed identity

When you use an Azure Storage Account from Azure Machine Learning studio, if you want to see Dataset Preview, you must enable "Use workspace managed identity for data preview and profiling in Azure Machine Learning studio" in datastore setting, and add these Azure RBAC roles of the storage account to the workspace managed identity:

  • Blob Data Reader
  • If the storage account uses a private endpoint to connect to the VNet, you must grant the Reader role for the storage account private endpoint to the managed identity.

For more information, see Use Azure Machine Learning studio in an Azure Virtual Network.

The following sections explain the limitations of using an Azure Storage Account, with your workspace, in a VNet.

Secure communication with Azure Storage Account

To secure communication between Azure Machine Learning and Azure Storage Accounts, configure the storage to Grant access to trusted Azure services.

Azure Storage firewall

When an Azure Storage account is located behind a virtual network, the storage firewall can normally be used to allow your client to directly connect over the internet. However, when using studio, your client doesn't connect to the storage account. The Azure Machine Learning service that makes the request connect to the storage account. The IP address of the service isn't documented, and it changes frequently. Enabling the storage firewall will not allow studio to access the storage account in a VNet configuration.

Azure Storage endpoint type

When the workspace uses a private endpoint, and the storage account is also in the VNet, extra validation requirements arise when using studio:

  • If the storage account uses a service endpoint, the workspace private endpoint and storage service endpoint must be located in the same subnet of the VNet.
  • If the storage account uses a private endpoint, the workspace private endpoint and storage private endpoint must be in located in the same VNet. In this case, they can be in different subnets.

Azure Data Lake Storage Gen1

When using Azure Data Lake Storage Gen1 as a datastore, you can only use POSIX-style access control lists. You can assign the workspace's managed identity access to resources, just like any other security principal. For more information, see Access control in Azure Data Lake Storage Gen1.

Azure Data Lake Storage Gen2

When using Azure Data Lake Storage Gen2 as a datastore, you can use both Azure RBAC and POSIX-style access control lists (ACLs) to control data access inside of a virtual network.

To use Azure RBAC, follow the steps described in this Datastore: Azure Storage Account article section. Data Lake Storage Gen2 is based on Azure Storage, so the same steps apply when using Azure RBAC.

To use ACLs, the managed identity of the workspace can be assigned access just like any other security principal. For more information, see Access control lists on files and directories.

Next steps

For information about enabling studio in a network, see Use Azure Machine Learning studio in an Azure Virtual Network.