What are compute targets in Azure Machine Learning?
A compute target is a designated compute resource or environment where you run your training script or host your service deployment. This location might be your local machine or a cloud-based compute resource. Using compute targets makes it easy for you to later change your compute environment without having to change your code.
In a typical model development lifecycle, you might:
- Start by developing and experimenting on a small amount of data. At this stage, use your local environment, such as a local computer or cloud-based virtual machine (VM), as your compute target.
- Scale up to larger data, or do distributed training by using one of these training compute targets.
- After your model is ready, deploy it to a web hosting environment or IoT device with one of these deployment compute targets.
The compute resources you use for your compute targets are attached to a workspace. Compute resources other than the local machine are shared by users of the workspace.
Training compute targets
Azure Machine Learning has varying support across different compute targets. A typical model development lifecycle starts with development or experimentation on a small amount of data. At this stage, use a local environment like your local computer or a cloud-based VM. As you scale up your training on larger datasets or perform distributed training, use Azure Machine Learning compute to create a single- or multi-node cluster that autoscales each time you submit a run. You can also attach your own compute resource, although support for different scenarios might vary.
Compute targets can be reused from one training job to the next. For example, after you attach a remote VM to your workspace, you can reuse it for multiple jobs. For machine learning pipelines, use the appropriate pipeline step for each compute target.
You can use any of the following resources for a training compute target for most jobs. Not all resources can be used for automated machine learning, machine learning pipelines, or designer.
Training targets | Automated machine learning | Machine learning pipelines | Azure Machine Learning designer |
---|---|---|---|
Local computer | Yes | ||
Azure Machine Learning compute cluster | Yes | Yes | Yes |
Azure Machine Learning compute instance | Yes (through SDK) | Yes | |
Remote VM | Yes | Yes | |
Azure Databricks | Yes (SDK local mode only) | Yes | |
Azure Data Lake Analytics | Yes | ||
Azure HDInsight | Yes | ||
Azure Batch | Yes |
Learn more about how to submit a training run to a compute target.
Compute targets for inference
The following compute resources can be used to host your model deployment.
The compute target you use to host your model will affect the cost and availability of your deployed endpoint. Use this table to choose an appropriate compute target.
Compute target | Used for | GPU support | FPGA support | Description |
---|---|---|---|---|
Local web service | Testing/debugging | Use for limited testing and troubleshooting. Hardware acceleration depends on use of libraries in the local system. | ||
Azure Kubernetes Service (AKS) | Real-time inference | Yes (web service deployment) | Yes | Use for high-scale production deployments. Provides fast response time and autoscaling of the deployed service. Cluster autoscaling isn't supported through the Azure Machine Learning SDK. To change the nodes in the AKS cluster, use the UI for your AKS cluster in the Azure portal. Supported in the designer. |
Azure Container Instances | Testing or development | Use for low-scale CPU-based workloads that require less than 48 GB of RAM. Supported in the designer. |
||
Azure Machine Learning compute clusters | Batch inference | Yes (machine learning pipeline) | Run batch scoring on serverless compute. Supports normal and low-priority VMs. No support for realtime inference. |
Note
Although compute targets like local, Azure Machine Learning compute, and Azure Machine Learning compute clusters support GPU for training and experimentation, using GPU for inference when deployed as a web service is supported only on AKS.
Using a GPU for inference when scoring with a machine learning pipeline is supported only on Azure Machine Learning compute.
When choosing a cluster SKU, first scale up and then scale out. Start with a machine that has 150% of the RAM your model requires, profile the result and find a machine that has the performance you need. Once you've learned that, increase the number of machines to fit your need for concurrent inference.
Note
- Container instances are suitable only for small models less than 1 GB in size.
- Use single-node AKS clusters for dev/test of larger models.
When performing inference, Azure Machine Learning creates a Docker container that hosts the model and associated resources needed to use it. This container is then used in one of the following deployment scenarios:
As a web service that's used for real-time inference. Web service deployments use one of the following compute targets:
- Local computer
- Azure Machine Learning compute instance
- Azure Container Instances
- Azure Kubernetes Service
- Azure Functions (preview). Deployment to Functions only relies on Azure Machine Learning to build the Docker container. From there, it's deployed by using Functions. For more information, see Deploy a machine learning model to Azure Functions (preview).
As a batch inference endpoint that's used to periodically process batches of data. Batch inferences use Azure Machine Learning compute clusters.
To an IoT device (preview). Deployment to an IoT device only relies on Azure Machine Learning to build the Docker container. From there, it's deployed by using Azure IoT Edge. For more information, see Deploy as an IoT Edge module (preview).
Learn where and how to deploy your model to a compute target.
Azure Machine Learning compute (managed)
A managed compute resource is created and managed by Azure Machine Learning. This compute is optimized for machine learning workloads. Azure Machine Learning compute clusters and compute instances are the only managed computes.
You can create Azure Machine Learning compute instances or compute clusters from:
- Azure Machine Learning studio.
- The Python SDK and CLI:
- The R SDK (preview).
- An Azure Resource Manager template. For an example template, see Create an Azure Machine Learning compute cluster.
- A machine learning extension for the Azure CLI.
When created, these compute resources are automatically part of your workspace, unlike other kinds of compute targets.
Capability | Compute cluster | Compute instance |
---|---|---|
Single- or multi-node cluster | ✓ | |
Autoscales each time you submit a run | ✓ | |
Automatic cluster management and job scheduling | ✓ | ✓ |
Support for both CPU and GPU resources | ✓ | ✓ |
Note
When a compute cluster is idle, it autoscales to 0 nodes, so you don't pay when it's not in use. A compute instance is always on and doesn't autoscale. You should stop the compute instance when you aren't using it to avoid extra cost.
Supported VM series and sizes
When you select a node size for a managed compute resource in Azure Machine Learning, you can choose from among select VM sizes available in Azure. Azure offers a range of sizes for Linux and Windows for different workloads. To learn more, see VM types and sizes.
There are a few exceptions and limitations to choosing a VM size:
- Some VM series aren't supported in Azure Machine Learning.
- Some VM series are restricted. To use a restricted series, contact support and request a quota increase for the series. For information on how to contact support, see Azure support options.
See the following table to learn more about supported series and restrictions.
Supported VM series | Restrictions |
---|---|
D | None. |
Dv2 | None. |
Dv3 | None. |
DSv2 | None. |
DSv3 | None. |
FSv2 | None. |
HBv2 | Requires approval. |
HCS | Requires approval. |
M | Requires approval. |
NC | None. |
NCsv2 | Requires approval. |
NCsv3 | Requires approval. |
NDs | Requires approval. |
NDv2 | Requires approval. |
NV | None. |
NVv3 | Requires approval. |
While Azure Machine Learning supports these VM series, they might not be available in all Azure regions. To check whether VM series are available, see Products available by region.
Note
Azure Machine Learning doesn't support all VM sizes that Azure Compute supports. To list the available VM sizes, use one of the following methods:
Compute isolation
Azure Machine Learning compute offers VM sizes that are isolated to a specific hardware type and dedicated to a single customer. Isolated VM sizes are best suited for workloads that require a high degree of isolation from other customers' workloads for reasons that include meeting compliance and regulatory requirements. Utilizing an isolated size guarantees that your VM will be the only one running on that specific server instance.
The current isolated VM offerings include:
- Standard_M128ms
- Standard_F72s_v2
- Standard_NC24s_v3
- Standard_NC24rs_v3*
*RDMA capable
To learn more about isolation, see Isolation in the Azure public cloud.
Unmanaged compute
An unmanaged compute target is not managed by Azure Machine Learning. You create this type of compute target outside Azure Machine Learning and then attach it to your workspace. Unmanaged compute resources can require additional steps for you to maintain or to improve performance for machine learning workloads.
Next steps
Learn how to: