Secure an Azure Machine Learning training environment with virtual networks

In this article, you learn how to secure training environments with a virtual network in Azure Machine Learning.

This article is part three of a five-part series that walks you through securing an Azure Machine Learning workflow. We highly recommend that you read through Part one: VNet overview to understand the overall architecture first.

See the other articles in this series:

1. VNet overview > 2. Secure the workspace > 3. Secure the training environment > 4. Secure the inferencing environment > 5. Enable studio functionality

In this article you learn how to secure the following training compute resources in a virtual network:

  • Azure Machine Learning compute cluster
  • Azure Machine Learning compute instance
  • Azure Databricks
  • Virtual Machine
  • HDInsight cluster

Prerequisites

  • Read the Network security overview article to understand common virtual network scenarios and overall virtual network architecture.

  • An existing virtual network and subnet to use with your compute resources.

  • To deploy resources into a virtual network or subnet, your user account must have permissions to the following actions in Azure role-based access control (Azure RBAC):

    • "Microsoft.Network/virtualNetworks/join/action" on the virtual network resource.
    • "Microsoft.Network/virtualNetworks/subnet/join/action" on the subnet resource.

    For more information on Azure RBAC with networking, see the Networking built-in roles

Compute clusters & instances

To use either a managed Azure Machine Learning compute target or an Azure Machine Learning compute instance in a virtual network, the following network requirements must be met:

  • The virtual network must be in the same subscription and region as the Azure Machine Learning workspace.
  • The subnet that's specified for the compute instance or cluster must have enough unassigned IP addresses to accommodate the number of VMs that are targeted. If the subnet doesn't have enough unassigned IP addresses, a compute cluster will be partially allocated.
  • Check to see whether your security policies or locks on the virtual network's subscription or resource group restrict permissions to manage the virtual network. If you plan to secure the virtual network by restricting traffic, leave some ports open for the compute service. For more information, see the Required ports section.
  • If you're going to put multiple compute instances or clusters in one virtual network, you might need to request a quota increase for one or more of your resources.
  • If the Azure Storage Account(s) for the workspace are also secured in a virtual network, they must be in the same virtual network as the Azure Machine Learning compute instance or cluster.
  • For compute instance Jupyter functionality to work, ensure that web socket communication is not disabled. Please ensure your network allows websocket connections to *.instances.azureml.net and *.instances.azureml.ms.
  • When compute instance is deployed in a private link workspace it can be only be accessed from within virtual network. If you are using custom DNS or hosts file please add an entry for <instance-name>.<region>.instances.azureml.ms with private IP address of workspace private endpoint. For more information see the custom DNS article.
  • The subnet used to deploy compute cluster/instance should not be delegated to any other service like ACI
  • Virtual network service endpoint policies do not work for compute cluster/instance system storage accounts

Tip

The Machine Learning compute instance or cluster automatically allocates additional networking resources in the resource group that contains the virtual network. For each compute instance or cluster, the service allocates the following resources:

  • One network security group
  • One public IP address. If you have Azure policy prohibiting Public IP creation then deployment of cluster/instances will fail
  • One load balancer

In the case of clusters these resources are deleted (and recreated) every time the cluster scales down to 0 nodes, however for an instance the resources are held onto till the instance is completely deleted (stopping does not remove the resources). These resources are limited by the subscription's resource quotas. If the virtual network resource group is locked then deletion of compute cluster/instance will fail. Load balancer cannot be deleted until the compute cluster/instance is deleted. Also please ensure there is no Azure policy which prohibits creation of network security groups.

Required ports

If you plan on securing the virtual network by restricting network traffic to/from the public internet, you must allow inbound communications from the Azure Batch service.

The Batch service adds network security groups (NSGs) at the level of network interfaces (NICs) that are attached to VMs. These NSGs automatically configure inbound and outbound rules to allow the following traffic:

  • Inbound TCP traffic on ports 29876 and 29877 from a Service Tag of BatchNodeManagement. Traffic over these ports is encrypted and is used by Azure Batch for scheduler/node communication.

    An inbound rule that uses the BatchNodeManagement service tag

  • (Optional) Inbound TCP traffic on port 22 to permit remote access. Use this port only if you want to connect by using SSH on the public IP.

  • Outbound traffic on any port to the virtual network.

  • Outbound traffic on any port to the internet.

  • For compute instance inbound TCP traffic on port 44224 from a Service Tag of AzureMachineLearning. Traffic over this port is encrypted and is used by Azure Machine Learning for communication with applications running on Compute Instances.

Important

Exercise caution if you modify or add inbound or outbound rules in Batch-configured NSGs. If an NSG blocks communication to the compute nodes, the compute service sets the state of the compute nodes to unusable.

You don't need to specify NSGs at the subnet level, because the Azure Batch service configures its own NSGs. However, if the subnet that contains the Azure Machine Learning compute has associated NSGs or a firewall, you must also allow the traffic listed earlier.

The NSG rule configuration in the Azure portal is shown in the following images:

The inbound NSG rules for Machine Learning Compute

Inbound NSG rules for Machine Learning Compute

Limit outbound connectivity from the virtual network

If you don't want to use the default outbound rules and you do want to limit the outbound access of your virtual network, use the following steps:

  • Deny outbound internet connection by using the NSG rules.

  • For a compute instance or a compute cluster, limit outbound traffic to the following items:

    • Azure Storage, by using Service Tag of Storage.RegionName. Where {RegionName} is the name of an Azure region.
    • Azure Container Registry, by using Service Tag of AzureContainerRegistry.RegionName. Where {RegionName} is the name of an Azure region.
    • Azure Machine Learning, by using Service Tag of AzureMachineLearning
    • Azure Resource Manager, by using Service Tag of AzureResourceManager
    • Azure Active Directory, by using Service Tag of AzureActiveDirectory

The NSG rule configuration in the Azure portal is shown in the following image:

The outbound NSG rules for Machine Learning Compute

Note

If you plan on using default Docker images provided by Microsoft, and enabling user managed dependencies, you must also use the following Service Tags:

  • MicrosoftContainerRegistry
  • AzureFrontDoor.FirstParty

This configuration is needed when you have code similar to the following snippets as part of your training scripts:

RunConfig training

# create a new runconfig object
run_config = RunConfiguration()

# configure Docker 
run_config.environment.docker.enabled = True
# For GPU, use DEFAULT_GPU_IMAGE
run_config.environment.docker.base_image = DEFAULT_CPU_IMAGE 
run_config.environment.python.user_managed_dependencies = True

Estimator training

est = Estimator(source_directory='.',
                script_params=script_params,
                compute_target='local',
                entry_script='dummy_train.py',
                user_managed=True)
run = exp.submit(est)

Forced tunneling

If you're using forced tunneling with Azure Machine Learning compute, you must allow communication with the public internet from the subnet that contains the compute resource. This communication is used for task scheduling and accessing Azure Storage.

There are two ways that you can accomplish this:

  • Use a Virtual Network NAT. A NAT gateway provides outbound internet connectivity for one or more subnets in your virtual network. For information, see Designing virtual networks with NAT gateway resources.

  • Add user-defined routes (UDRs) to the subnet that contains the compute resource. Establish a UDR for each IP address that's used by the Azure Batch service in the region where your resources exist. These UDRs enable the Batch service to communicate with compute nodes for task scheduling. Also add the IP address for the Azure Machine Learning service, as this is required for access to Compute Instances. When adding the IP for the Azure Machine Learning service, you must add the IP for both the primary and secondary Azure regions. The primary region being the one where your workspace is located.

    To find the secondary region, see the Ensure business continuity & disaster recovery using Azure Paired Regions. For example, if your Azure Machine Learning service is in East US 2, the secondary region is Central US.

    To get a list of IP addresses of the Batch service and Azure Machine Learning service, use one of the following methods:

    • Download the Azure IP Ranges and Service Tags and search the file for BatchNodeManagement.<region> and AzureMachineLearning.<region>, where <region> is your Azure region.

    • Use the Azure CLI to download the information. The following example downloads the IP address information and filters out the information for the East US 2 region (primary) and Central US region (secondary):

      az network list-service-tags -l "East US 2" --query "values[?starts_with(id, 'Batch')] | [?properties.region=='eastus2']"
      # Get primary region IPs
      az network list-service-tags -l "East US 2" --query "values[?starts_with(id, 'AzureMachineLearning')] | [?properties.region=='eastus2']"
      # Get secondary region IPs
      az network list-service-tags -l "Central US" --query "values[?starts_with(id, 'AzureMachineLearning')] | [?properties.region=='centralus']"
      

      Tip

      If you are using the US-Virginia, US-Arizona regions, or China-East-2 regions, these commands return no IP addresses. Instead, use one of the following links to download a list of IP addresses:

    When you add the UDRs, define the route for each related Batch IP address prefix and set Next hop type to Internet. The following image shows an example of this UDR in the Azure portal:

    Example of a UDR for an address prefix

    Important

    The IP addresses may change over time.

    In addition to any UDRs that you define, outbound traffic to Azure Storage must be allowed through your on-premises network appliance. Specifically, the URLs for this traffic are in the following forms: <account>.table.core.windows.net, <account>.queue.core.windows.net, and <account>.blob.core.windows.net.

    For more information, see Create an Azure Batch pool in a virtual network.

Create a compute cluster in a virtual network

To create a Machine Learning Compute cluster, use the following steps:

  1. Sign in to Azure Machine Learning studio, and then select your subscription and workspace.

  2. Select Compute on the left.

  3. Select Training clusters from the center, and then select +.

  4. In the New Training Cluster dialog, expand the Advanced settings section.

  5. To configure this compute resource to use a virtual network, perform the following actions in the Configure virtual network section:

    1. In the Resource group drop-down list, select the resource group that contains the virtual network.
    2. In the Virtual network drop-down list, select the virtual network that contains the subnet.
    3. In the Subnet drop-down list, select the subnet to use.

    The virtual network settings for Machine Learning Compute

You can also create a Machine Learning Compute cluster by using the Azure Machine Learning SDK. The following code creates a new Machine Learning Compute cluster in the default subnet of a virtual network named mynetwork:

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# The Azure virtual network name, subnet, and resource group
vnet_name = 'mynetwork'
subnet_name = 'default'
vnet_resourcegroup_name = 'mygroup'

# Choose a name for your CPU cluster
cpu_cluster_name = "cpucluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print("Found existing cpucluster")
except ComputeTargetException:
    print("Creating new cpucluster")

    # Specify the configuration for the new cluster
    compute_config = AmlCompute.provisioning_configuration(vm_size="STANDARD_D2_V2",
                                                           min_nodes=0,
                                                           max_nodes=4,
                                                           vnet_resourcegroup_name=vnet_resourcegroup_name,
                                                           vnet_name=vnet_name,
                                                           subnet_name=subnet_name)

    # Create the cluster with the specified name and configuration
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

    # Wait for the cluster to be completed, show the output log
    cpu_cluster.wait_for_completion(show_output=True)

When the creation process finishes, you train your model by using the cluster in an experiment. For more information, see Select and use a compute target for training.

Note

You may choose to use low-priority VMs to run some or all of your workloads. See how to create a low-priority VM.

Access data in a Compute Instance notebook

If you're using notebooks on an Azure Compute instance, you must ensure that your notebook is running on a compute resource behind the same virtual network and subnet as your data.

You must configure your Compute Instance to be in the same virtual network during creation under Advanced settings > Configure virtual network. You cannot add an existing Compute Instance to a virtual network.

Azure Databricks

To use Azure Databricks in a virtual network with your workspace, the following requirements must be met:

  • The virtual network must be in the same subscription and region as the Azure Machine Learning workspace.
  • If the Azure Storage Account(s) for the workspace are also secured in a virtual network, they must be in the same virtual network as the Azure Databricks cluster.
  • In addition to the databricks-private and databricks-public subnets used by Azure Databricks, the default subnet created for the virtual network is also required.

For specific information on using Azure Databricks with a virtual network, see Deploy Azure Databricks in your Azure Virtual Network.

Virtual machine or HDInsight cluster

Important

Azure Machine Learning supports only virtual machines that are running Ubuntu.

In this section you learn how to use a virtual machine or Azure HDInsight cluster in a virtual network with your workspace.

Create the VM or HDInsight cluster

Create a VM or HDInsight cluster by using the Azure portal or the Azure CLI, and put the cluster in an Azure virtual network. For more information, see the following articles:

Configure network ports

Allow Azure Machine Learning to communicate with the SSH port on the VM or cluster, configure a source entry for the network security group. The SSH port is usually port 22. To allow traffic from this source, do the following actions:

  1. In the Source drop-down list, select Service Tag.

  2. In the Source service tag drop-down list, select AzureMachineLearning.

    Inbound rules for doing experimentation on a VM or HDInsight cluster within a virtual network

  3. In the Source port ranges drop-down list, select *.

  4. In the Destination drop-down list, select Any.

  5. In the Destination port ranges drop-down list, select 22.

  6. Under Protocol, select Any.

  7. Under Action, select Allow.

Keep the default outbound rules for the network security group. For more information, see the default security rules in Security groups.

If you don't want to use the default outbound rules and you do want to limit the outbound access of your virtual network, see the Limit outbound connectivity from the virtual network section.

Attach the VM or HDInsight cluster

Attach the VM or HDInsight cluster to your Azure Machine Learning workspace. For more information, see Set up compute targets for model training.

Next steps

This article is part three in a four-part virtual network series. See the rest of the articles to learn how to secure a virtual network: