Secure an Azure Machine Learning training environment with virtual networks

In this article, you learn how to secure training environments with a virtual network in Azure Machine Learning.

Tip

This article is part of a series on securing an Azure Machine Learning workflow. See the other articles in this series:

For a tutorial on creating a secure workspace, see Tutorial: Create a secure workspace or Tutorial: Create a secure workspace using a template.

In this article you learn how to secure the following training compute resources in a virtual network:

  • Azure Machine Learning compute cluster
  • Azure Machine Learning compute instance
  • Azure Databricks
  • Virtual Machine
  • HDInsight cluster

Prerequisites

  • Read the Network security overview article to understand common virtual network scenarios and overall virtual network architecture.

  • An existing virtual network and subnet to use with your compute resources.

  • To deploy resources into a virtual network or subnet, your user account must have permissions to the following actions in Azure role-based access control (Azure RBAC):

    • "Microsoft.Network/virtualNetworks/*/read" on the virtual network resource. This is not needed for Azure Resource Manager (ARM) template deployments
    • "Microsoft.Network/virtualNetworks/subnet/join/action" on the subnet resource.

    For more information on Azure RBAC with networking, see the Networking built-in roles

Azure Machine Learning compute cluster/instance

  • The virtual network must be in the same subscription as the Azure Machine Learning workspace.

  • The subnet used for the compute instance or cluster must have enough unassigned IP addresses.

    • A compute cluster can dynamically scale. If there aren't enough unassigned IP addresses, the cluster will be partially allocated.
    • A compute instance only requires one IP address.
  • To create a compute cluster or instance without a public IP address (a preview feature), your workspace must use a private endpoint to connect to the VNet. For more information, see Configure a private endpoint for Azure Machine Learning workspace.

  • Make sure that there are no security policies or locks that restrict permissions to manage the virtual network. When checking for policies or locks, look at both the subscription and resource group for the virtual network.

  • Check to see whether your security policies or locks on the virtual network's subscription or resource group restrict permissions to manage the virtual network.

  • If you plan to secure the virtual network by restricting traffic, see the Required public internet access section.

  • The subnet used to deploy compute cluster/instance shouldn't be delegated to any other service. For example, it shouldn't be delegated to ACI.

Azure Databricks

  • The virtual network must be in the same subscription and region as the Azure Machine Learning workspace.
  • If the Azure Storage Account(s) for the workspace are also secured in a virtual network, they must be in the same virtual network as the Azure Databricks cluster.

Limitations

Azure Machine Learning compute cluster/instance

  • If put multiple compute instances or clusters in one virtual network, you may need to request a quota increase for one or more of your resources. The Machine Learning compute instance or cluster automatically allocates additional networking resources in the resource group that contains the virtual network. For each compute instance or cluster, the service allocates the following resources:

    • One network security group (NSG). This NSG contains the following rules, which are specific to compute cluster and compute instance:

      • Allow inbound TCP traffic on ports 29876-29877 from the BatchNodeManagement service tag.
      • Allow inbound TCP traffic on port 44224 from the AzureMachineLearning service tag.

      The following screenshot shows an example of these rules:

      Screenshot of NSG

      Tip

      If your compute cluster or instance does not use a public IP address (a preview feature), these inbound NSG rules are not required.

    • For compute cluster or instance, it is now possible to remove the public IP address (a preview feature). If you have Azure Policy assignments prohibiting Public IP creation then deployment of the compute cluster or instance will succeed.

    • One load balancer

    For compute clusters, these resources are deleted every time the cluster scales down to 0 nodes and created when scaling up.

    For a compute instance, these resources are kept until the instance is deleted. Stopping the instance does not remove the resources.

    Important

    These resources are limited by the subscription's resource quotas. If the virtual network resource group is locked then deletion of compute cluster/instance will fail. Load balancer cannot be deleted until the compute cluster/instance is deleted. Also please ensure there is no Azure Policy assignment which prohibits creation of network security groups.

  • If the Azure Storage Accounts for the workspace are also in the virtual network, use the following guidance on subnet limitations:

    • If you plan to use Azure Machine Learning studio to visualize data or use designer, the storage account must be in the same subnet as the compute instance or cluster.
    • If you plan to use the SDK, the storage account can be in a different subnet.

    Note

    Adding a resource instance for your workspace or selecting the checkbox for "Allow trusted Microsoft services to access this account" is not sufficient to allow communication from the compute.

  • When your workspace uses a private endpoint, the compute instance can only be accessed from inside the virtual network. If you use a custom DNS or hosts file, add an entry for <instance-name>.<region>.instances.azureml.ms. Map this entry to the private IP address of the workspace private endpoint. For more information, see the custom DNS article.

  • Virtual network service endpoint policies don't work for compute cluster/instance system storage accounts.

  • If storage and compute instance are in different regions, you may see intermittent timeouts.

  • If the Azure Container Registry for your workspace uses a private endpoint to connect to the virtual network, you cannot use a managed identity for the compute instance. To use a managed identity with the compute instance, do not put the container registry in the VNet.

  • If you want to use Jupyter Notebooks on a compute instance:

    • Don't disable websocket communication. Make sure your network allows websocket communication to *.instances.azureml.net and *.instances.azureml.ms.
    • Make sure that your notebook is running on a compute resource behind the same virtual network and subnet as your data. When creating the compute instance, use Advanced settings > Configure virtual network to select the network and subnet.
  • Compute clusters can be created in a different region than your workspace. This functionality is in preview, and is only available for compute clusters, not compute instances. When using a different region for the cluster, the following limitations apply:

    • If your workspace associated resources, such as storage, are in a different virtual network than the cluster, set up global virtual network peering between the networks. For more information, see Virtual network peering.
    • You may see increased network latency and data transfer costs. The latency and costs can occur when creating the cluster, and when running jobs on it.

    Guidance such as using NSG rules, user-defined routes, and input/output requirements, apply as normal when using a different region than the workspace.

    Warning

    If you are using a private endpoint-enabled workspace, creating the cluster in a different region is not supported.

Azure Databricks

  • In addition to the databricks-private and databricks-public subnets used by Azure Databricks, the default subnet created for the virtual network is also required.
  • Azure Databricks does not use a private endpoint to communicate with the virtual network.

For more information on using Azure Databricks in a virtual network, see Deploy Azure Databricks in your Azure Virtual Network.

Azure HDInsight or virtual machine

  • Azure Machine Learning supports only virtual machines that are running Ubuntu.

Required public internet access

Azure Machine Learning requires both inbound and outbound access to the public internet. The following tables provide an overview of what access is required and what it is for. The protocol for all items is TCP. For service tags that end in .region, replace region with the Azure region that contains your workspace. For example, Storage.westus:

Direction Ports Service tag Purpose
Inbound 29876-29877 BatchNodeManagement Create, update, and delete of Azure Machine Learning compute instance and compute cluster.
Inbound 44224 AzureMachineLearning Create, update, and delete of Azure Machine Learning compute instance.
Outbound 443 AzureMonitor Used to log monitoring and metrics to App Insights and Azure Monitor.
Outbound 80, 443 AzureActiveDirectory Authentication using Azure AD.
Outbound 443 AzureMachineLearning Using Azure Machine Learning services.
Outbound 443 AzureResourceManager Creation of Azure resources with Azure Machine Learning.
Outbound 443 Storage.region Access data stored in the Azure Storage Account for the Azure Batch service.
Outbound 443 AzureFrontDoor.FrontEnd
* Not needed in Azure China.
Global entry point for Azure Machine Learning studio.
Outbound 443 ContainerRegistry.region Access docker images provided by Microsoft.
Outbound 443 MicrosoftContainerRegistry.region Access docker images provided by Microsoft. Setup of the Azure Machine Learning router for Azure Kubernetes Service.
Outbound 443 Keyvault.region Access the key vault for the Azure Batch service. Only needed if your workspace was created with the hbi_workspace flag enabled.

Tip

If you need the IP addresses instead of service tags, use one of the following options:

The IP addresses may change periodically.

You may also need to allow outbound traffic to Visual Studio Code and non-Microsoft sites for the installation of packages required by your machine learning project. The following table lists commonly used repositories for machine learning:

Host name Purpose
anaconda.com
*.anaconda.com
Used to install default packages.
*.anaconda.org Used to get repo data.
pypi.org Used to list dependencies from the default index, if any, and the index is not overwritten by user settings. If the index is overwritten, you must also allow *.pythonhosted.org.
cloud.r-project.org Used when installing CRAN packages for R development.
*pytorch.org Used by some examples based on PyTorch.
*.tensorflow.org Used by some examples based on Tensorflow.
update.code.visualstudio.com

*.vo.msecnd.net
Used to retrieve VS Code server bits, which are installed on the compute instance through a setup script.
raw.githubusercontent.com/microsoft/vscode-tools-for-ai/master/azureml_remote_websocket_server/* Used to retrieve websocket server bits, which are installed on the compute instance. The websocket server is used to transmit requests from Visual Studio Code client (desktop application) to Visual Studio Code server running on the compute instance.

When using Azure Kubernetes Service (AKS) with Azure Machine Learning, allow the following traffic to the AKS VNet:

For information on using a firewall solution, see Use a firewall with Azure Machine Learning.

Compute clusters

Use the tabs below to select how you plan to create a compute cluster:

Use the following steps to create a compute cluster in the Azure Machine Learning studio:

  1. Sign in to Azure Machine Learning studio, and then select your subscription and workspace.

  2. Select Compute on the left, Compute clusters from the center, and then select + New.

    Screenshot of creating a cluster

  3. In the Create compute cluster dialog, select the VM size and configuration you need and then select Next.

    Screenshot of setting VM config

  4. From the Configure Settings section, set the Compute name, Virtual network, and Subnet.

    Screenshot shows setting compute name, virtual network, and subnet.

    Tip

    If your workspace uses a private endpoint to connect to the virtual network, the Virtual network selection field is greyed out.

  5. Select Create to create the compute cluster.

When the creation process finishes, you train your model by using the cluster in an experiment. For more information, see Select and use a compute target for training.

Note

You may choose to use low-priority VMs to run some or all of your workloads. See how to create a low-priority VM.

No public IP for compute clusters (preview)

When you enable No public IP, your compute cluster doesn't use a public IP for communication with any dependencies. Instead, it communicates solely within the virtual network using Azure Private Link ecosystem as well as service/private endpoints, eliminating the need for a public IP entirely. No public IP removes access and discoverability of compute cluster nodes from the internet thus eliminating a significant threat vector. No public IP clusters help comply with no public IP policies many enterprises have.

A compute cluster with No public IP enabled has no inbound communication requirements from public internet compared to those for public IP compute cluster. Specifically, neither inbound NSG rule (BatchNodeManagement, AzureMachineLearning) is required. You still need to allow inbound from source of VirtualNetwork and any port source, to destination of VirtualNetwork, and destination port of 29876, 29877.

No public IP clusters are dependent on Azure Private Link for Azure Machine Learning workspace. A compute cluster with No public IP also requires you to disable private endpoint network policies and private link service network policies. These requirements come from Azure private link service and private endpoints and are not Azure Machine Learning specific. Follow instruction from Disable network policies for Private Link service to set the parameters disable-private-endpoint-network-policies and disable-private-link-service-network-policies on the virtual network subnet.

For outbound connections to work, you need to set up an egress firewall such as Azure firewall with user defined routes. For instance, you can use a firewall set up with inbound/outbound configuration and route traffic there by defining a route table on the subnet in which the compute cluster is deployed. The route table entry can set up the next hop of the private IP address of the firewall with the address prefix of 0.0.0.0/0.

You can use a service endpoint or private endpoint for your Azure container registry and Azure storage in the subnet in which cluster is deployed.

To create a no public IP address compute cluster (a preview feature) in studio, set No public IP checkbox in the virtual network section. You can also create no public IP compute cluster through an ARM template. In the ARM template set enableNodePublicIP parameter to false.

Note

Support for compute instances without public IP addresses is currently available and in public preview for the following regions: France Central, East Asia, West Central US, South Central US, West US 2, East US, East US 2, North Europe, West Europe, Central US, North Central US, West US, Australia East, Japan East, Japan West.

Support for compute clusters without public IP addresses is currently available and in public preview for the following regions: France Central, East Asia, West Central US, South Central US, West US 2, East US, North Europe, East US 2, Central US, West Europe, North Central US, West US, Australia East, Japan East, Japan West.

Troubleshooting

  • If you get this error message during creation of cluster "The specified subnet has PrivateLinkServiceNetworkPolicies or PrivateEndpointNetworkEndpoints enabled" please follow the instructions from Disable network policies for Private Link service and Disable network policies for Private Endpoint.

  • If job execution fails with connection issues to ACR or Azure Storage, verify that customer has added ACR and Azure Storage service endpoint/private endpoints to subnet and ACR/Azure Storage allows the access from the subnet.

  • To ensure that you have created a no public IP cluster, in Studio when looking at cluster details you will see No Public IP property is set to true under resource properties.

Compute instance

For steps on how to create a compute instance deployed in a virtual network, see Create and manage an Azure Machine Learning compute instance.

No public IP for compute instances (preview)

When you enable No public IP, your compute instance doesn't use a public IP for communication with any dependencies. Instead, it communicates solely within the virtual network using Azure Private Link ecosystem as well as service/private endpoints, eliminating the need for a public IP entirely. No public IP removes access and discoverability of compute instance node from the internet thus eliminating a significant threat vector. Compute instances will also do packet filtering to reject any traffic from outside virtual network. No public IP instances are dependent on Azure Private Link for Azure Machine Learning workspace.

For outbound connections to work, you need to set up an egress firewall such as Azure firewall with user defined routes. For instance, you can use a firewall set up with invound/outbound configuration and route traffic there by defining a route table on the subnet in which the compute instance is deployed. The route table entry can set up the next hop of the private IP address of the firewall with the address prefix of 0.0.0.0/0.

A compute instance with No public IP enabled has no inbound communication requirements from public internet compared to those for public IP compute instance. Specifically, neither inbound NSG rule (BatchNodeManagement, AzureMachineLearning) is required. You still need to allow inbound from source of VirtualNetwork, any port source, destination of VirtualNetwork, and destination port of 29876, 29877, 44224.

A compute instance with No public IP also requires you to disable private endpoint network policies and private link service network policies. These requirements come from Azure private link service and private endpoints and are not Azure Machine Learning specific. Follow instruction from Disable network policies for Private Link service source IP to set the parameters disable-private-endpoint-network-policies and disable-private-link-service-network-policies on the virtual network subnet.

To create a no public IP address compute instance (a preview feature) in studio, set No public IP checkbox in the virtual network section. You can also create no public IP compute instance through an ARM template. In the ARM template set enableNodePublicIP parameter to false.

Next steps:

Note

Support for compute instances without public IP addresses is currently available and in public preview for the following regions: France Central, East Asia, West Central US, South Central US, West US 2, East US, East US 2, North Europe, West Europe, Central US, North Central US, West US, Australia East, Japan East, Japan West.

Support for compute clusters without public IP addresses is currently available and in public preview for the following regions: France Central, East Asia, West Central US, South Central US, West US 2, East US, North Europe, East US 2, Central US, West Europe, North Central US, West US, Australia East, Japan East, Japan West.

Inbound traffic

When using Azure Machine Learning compute instance (with a public IP) or compute cluster, allow inbound traffic from Azure Batch management and Azure Machine Learning services. Compute instance with no public IP (preview) does not require this inbound communication. A Network Security Group allowing this traffic is dynamically created for you, however you may need to also create user-defined routes (UDR) if you have a firewall. When creating a UDR for this traffic, you can use either IP Addresses or service tags to route the traffic.

Important

Using service tags with user-defined routes is currently in preview and may not be fully supported. For more information, see Virtual Network routing.

Tip

While a compute instance without a public IP (a preview feature) does not need a UDR for this inbound traffic, you will still need these UDRs if you also use a compute cluster or a compute instance with a public IP.

For the Azure Machine Learning service, you must add the IP address of both the primary and secondary regions. To find the secondary region, see the Cross-region replication in Azure. For example, if your Azure Machine Learning service is in East US 2, the secondary region is Central US.

To get a list of IP addresses of the Batch service and Azure Machine Learning service, download the Azure IP Ranges and Service Tags and search the file for BatchNodeManagement.<region> and AzureMachineLearning.<region>, where <region> is your Azure region.

Important

The IP addresses may change over time.

When creating the UDR, set the Next hop type to Internet. The following image shows an example IP address based UDR in the Azure portal:

Image of a user-defined route configuration

For information on configuring UDR, see Route network traffic with a routing table.

For more information on input and output traffic requirements for Azure Machine Learning, see Use a workspace behind a firewall.

Azure Databricks

For specific information on using Azure Databricks with a virtual network, see Deploy Azure Databricks in your Azure Virtual Network.

Virtual machine or HDInsight cluster

In this section, you learn how to use a virtual machine or Azure HDInsight cluster in a virtual network with your workspace.

Create the VM or HDInsight cluster

Create a VM or HDInsight cluster by using the Azure portal or the Azure CLI, and put the cluster in an Azure virtual network. For more information, see the following articles:

Configure network ports

Allow Azure Machine Learning to communicate with the SSH port on the VM or cluster, configure a source entry for the network security group. The SSH port is usually port 22. To allow traffic from this source, do the following actions:

  1. In the Source drop-down list, select Service Tag.

  2. In the Source service tag drop-down list, select AzureMachineLearning.

    Inbound rules for doing experimentation on a VM or HDInsight cluster within a virtual network

  3. In the Source port ranges drop-down list, select *.

  4. In the Destination drop-down list, select Any.

  5. In the Destination port ranges drop-down list, select 22.

  6. Under Protocol, select Any.

  7. Under Action, select Allow.

Keep the default outbound rules for the network security group. For more information, see the default security rules in Security groups.

If you don't want to use the default outbound rules and you do want to limit the outbound access of your virtual network, see the required public internet access section.

Attach the VM or HDInsight cluster

Attach the VM or HDInsight cluster to your Azure Machine Learning workspace. For more information, see Set up compute targets for model training.

Next steps

This article is part of a series on securing an Azure Machine Learning workflow. See the other articles in this series: