Use Azure Machine Learning studio in an Azure virtual network
In this article, you learn how to use Azure Machine Learning studio in a virtual network. You learn how to:
- Access the studio from a resource inside of a virtual network.
- Configure private endpoints for storage accounts.
- Give the studio access to data stored inside of a virtual network.
- Understand how the studio impacts storage security.
This article is part five of a five-part series that walks you through securing an Azure Machine Learning workflow. We highly recommend that you read through Part one: VNet overview to understand the overall architecture first.
See the other articles in this series:
If your workspace is in a sovereign cloud, such as Azure Government or Azure China 21Vianet, integrated notebooks do not support using storage that is in a virtual network. Instead, you can use Jupyter Notebooks from a compute instance. For more information, see the Access data in a Compute Instance notebook section.
Read the Network security overview to understand common virtual network scenarios and overall virtual network architecture.
A pre-existing virtual network and subnet to use.
An existing Azure storage account added your virtual network.
Access the studio from a resource inside the VNet
If you are accessing the studio from a resource inside of a virtual network (for example, a compute instance or virtual machine), you must allow outbound traffic from the virtual network to the studio.
For example, if you are using network security groups (NSG) to restrict outbound traffic, add a rule to a service tag destination of AzureFrontDoor.Frontend.
Access data using the studio
After you add an Azure storage account to your virtual network with a service endpoint or private endpoint, you must configure your storage account to use managed identity to grant the studio access to your data.
If you do not enable managed identity, you will receive this error,
Error: Unable to profile this dataset. This might be because your data is stored behind a virtual network or your data does not support profile. Additionally, the following operations will be disabled:
- Preview data in the studio.
- Visualize data in the designer.
- Submit an AutoML experiment.
- Start a labeling project.
The studio supports reading data from the following datastore types in a virtual network:
- Azure Blob
- Azure Data Lake Storage Gen1
- Azure Data Lake Storage Gen2
- Azure SQL Database
Grant workspace managed identity Reader access to storage private link
Configure datastores to use workspace managed identity
Azure Machine Learning uses datastores to connect to storage accounts. Use the following steps to configure your datastores to use managed identity.
In the studio, select Datastores.
To create a new datastore, select + New datastore.
To update an existing datastore, select the datastore and select Update credentials.
In the datastore settings, select Yes for Allow Azure Machine Learning service to access the storage using workspace-managed identity.
These steps add the workspace-managed identity as a Reader to the storage service using Azure role-based access control (Azure RBAC). Reader access lets the workspace retrieve firewall settings, and ensure that data doesn't leave the virtual network.
These changes may take up to 10 minutes to take effect.
Technical notes for managed identity
Using managed identity to access storage services impacts some security considerations. This section describes the changes for each storage account type.
These considerations are unique to the type of storage account you are accessing.
Azure Blob storage
For Azure Blob storage, the workspace-managed identity is also added as a Blob Data Reader so that it can read data from blob storage.
Azure Data Lake Storage Gen2 access control
You can use both Azure RBAC and POSIX-style access control lists (ACLs) to control data access inside of a virtual network.
To use ACLs, the workspace-managed identity can be assigned access just like any other security principle. For more information, see Access control lists on files and directories.
Azure Data Lake Storage Gen1 access control
Azure Data Lake Storage Gen1 only supports POSIX-style access control lists. You can assign the workspace-managed identity access to resources just like any other security principle. For more information, see Access control in Azure Data Lake Storage Gen1.
Azure SQL Database contained user
To access data stored in an Azure SQL Database using managed identity, you must create a SQL contained user that maps to the managed identity. For more information on creating a user from an external provider, see Create contained users mapped to Azure AD identities.
After you create a SQL contained user, grant permissions to it by using the GRANT T-SQL command.
Azure Machine Learning designer default datastore
The designer uses the storage account attached to your workspace to store output by default. However, you can specify it to store output to any datastore that you have access to. If your environment uses virtual networks, you can use these controls to ensure your data remains secure and accessible.
To set a new default storage for a pipeline:
- In a pipeline draft, select the Settings gear icon near the title of your pipeline.
- Select the Select default datastore.
- Specify a new datastore.
You can also override the default datastore on a per-module basis. This gives you control over the storage location for each individual module.
- Select the module whose output you want to specify.
- Expand the Output settings section.
- Select Override default output settings.
- Select Set output settings.
- Specify a new datastore.
This article is an optional part of a four-part virtual network series. See the rest of the articles to learn how to secure a virtual network: