Plan and manage costs for Azure Machine Learning
This article describes how to plan and manage costs for Azure Machine Learning. First, you use the Azure pricing calculator to help plan for costs before you add any resources. Next, as you add the Azure resources, review the estimated costs. Finally, use cost-saving tips as you train your model with managed Azure Machine Learning compute clusters.
After you've started using Azure Machine Learning resources, use the cost management features to set budgets and monitor costs. Also review the forecasted costs and identify spending trends to identify areas where you might want to act.
Understand that the costs for Azure Machine Learning are only a portion of the monthly costs in your Azure bill. If you are using other Azure services, you're billed for all the Azure services and resources used in your Azure subscription, including the third-party services. This article explains how to plan for and manage costs for Azure Machine Learning. After you're familiar with managing costs for Azure Machine Learning, apply similar methods to manage costs for all the Azure services used in your subscription.
When you train your machine learning models, use managed Azure Machine Learning compute clusters to take advantage of more cost-saving tips:
- Configure your training clusters for autoscaling
- Set quotas on your subscription and workspaces
- Set termination policies on your training run
- Use low-priority virtual machines (VM)
- Use an Azure Reserved VM Instance
Cost analysis supports different kinds of Azure account types. To view the full list of supported account types, see Understand Cost Management data. To view cost data, you need at least read access for your Azure account.
For information about assigning access to Azure Cost Management data, see Assign access to data.
Estimate costs before using Azure Machine Learning
Use the Azure pricing calculator to estimate costs before you create the resources in an Azure Machine Learning account. On the left, select AI + Machine Learning, then select Azure Machine Learning to begin.
The following screenshot shows the cost estimation by using the calculator:
As you add new resources to your workspace, return to this calculator and add the same resource here to update your cost estimates.
For more information, see Azure Machine Learning pricing.
Understand the full billing model for Azure Machine Learning
Azure Machine Learning runs on Azure infrastructure that accrues costs along with Azure Machine Learning when you deploy the new resource. It's important to understand that additional infrastructure might accrue cost. You need to manage that cost when you make changes to deployed resources.
Costs that typically accrue with Azure Machine Learning
When you create resources for an Azure Machine Learning workspace, resources for other Azure services are also created. They are:
- Azure Container Registry Basic account
- Azure Block Blob Storage (general purpose v1)
- Key Vault
- Application Insights
Costs might accrue after resource deletion
When you delete an Azure Machine Learning workspace in the Azure portal or with Azure CLI, the following resources continue to exist. They continue to accrue costs until you delete them.
- Azure Container Registry
- Azure Block Blob Storage
- Key Vault
- Application Insights
To delete the workspace along with these dependent resources, use the SDK:
If you create Azure Kubernetes Service (AKS) in your workspace, or if you attach any compute resources to your workspace you must delete them separately in Azure portal.
Using Azure Prepayment credit with Azure Machine Learning
You can pay for Azure Machine Learning charges with your Azure Prepayment (previously called monetary commitment) credit. However, you can't use Azure Prepayment to pay for charges for third party products and services including those from the Azure Marketplace.
You can create budgets to manage costs and create alerts that automatically notify stakeholders of spending anomalies and overspending risks. Alerts are based on spending compared to budget and cost thresholds. Budgets and alerts are created for Azure subscriptions and resource groups, so they're useful as part of an overall cost monitoring strategy.
Budgets can be created with filters for specific resources or services in Azure if you want more granularity present in your monitoring. Filters help ensure that you don't accidentally create new resources that cost you additional money. For more about the filter options when you when create a budget, see Group and filter options.
Export cost data
You can also export your cost data to a storage account. This is helpful when you need or others to do additional data analysis for costs. For example, a finance teams can analyze the data using Excel or Power BI. You can export your costs on a daily, weekly, or monthly schedule and set a custom date range. Exporting cost data is the recommended way to retrieve cost datasets.
Other ways to manage and reduce costs for Azure Machine Learning
Use these tips for containing costs on your machine learning compute resources.
Use Azure Machine Learning compute cluster (AmlCompute)
With constantly changing data, you need fast and streamlined model training and retraining to maintain accurate models. However, continuous training comes at a cost, especially for deep learning models on GPUs.
Azure Machine Learning users can use the managed Azure Machine Learning compute cluster, also called AmlCompute. AmlCompute supports a variety of GPU and CPU options. The AmlCompute is internally hosted on behalf of your subscription by Azure Machine Learning. It provides the same enterprise grade security, compliance and governance at Azure IaaS cloud scale.
Because these compute pools are inside of Azure's IaaS infrastructure, you can deploy, scale, and manage your training with the same security and compliance requirements as the rest of your infrastructure. These deployments occur in your subscription and obey your governance rules. Learn more about Azure Machine Learning compute.
Configure training clusters for autoscaling
Autoscaling clusters based on the requirements of your workload helps reduce your costs so you only use what you need.
AmlCompute clusters are designed to scale dynamically based on your workload. The cluster can be scaled up to the maximum number of nodes you configure. As each run completes, the cluster will release nodes and scale to your configured minimum node count.
To avoid charges when no jobs are running, set the minimum nodes to 0. This setting allows Azure Machine Learning to de-allocate the nodes when they aren't in use. Any value larger than 0 will keep that number of nodes running, even if they are not in use.
You can also configure the amount of time the node is idle before scale down. By default, idle time before scale down is set to 120 seconds.
- If you perform less iterative experimentation, reduce this time to save costs.
- If you perform highly iterative dev/test experimentation, you might need to increase the time so you aren't paying for constant scaling up and down after each change to your training script or environment.
az ml computetarget create amlcompute --name testcluster --vm-size Standard_NC6 --min-nodes 0 --max-nodes 5 --idle-seconds-before-scaledown 300
Set quotas on resources
AmlCompute comes with a quota (or limit) configuration. This quota is by VM family (for example, Dv2 series, NCv3 series) and varies by region for each subscription. Subscriptions start with small defaults to get you going, but use this setting to control the amount of Amlcompute resources available to be spun up in your subscription.
Also configure workspace level quota by VM family, for each workspace within a subscription. Doing so allows you to have more granular control on the costs that each workspace might potentially incur and restrict certain VM families.
To set quotas at the workspace level, start in the Azure portal. Select any workspace in your subscription, and select Usages + quotas in the left pane. Then select the Configure quotas tab to view the quotas. You need privileges at the subscription scope to set the quota, since it's a setting that affects multiple workspaces.
Set run autotermination policies
In some cases, you should configure your training runs to limit their duration or terminate them early. For example, when you are using Azure Machine Learning's built-in hyperparameter tuning or automated machine learning.
Here are a few options that you have:
- Define a parameter called
max_run_duration_secondsin your RunConfiguration to control the maximum duration a run can extend to on the compute you choose (either local or remote cloud compute).
- For hyperparameter tuning, define an early termination policy from a Bandit policy, a Median stopping policy, or a Truncation selection policy. To further control hyperparameter sweeps, use parameters such as
- For automated machine learning, set similar termination policies using the
enable_early_stoppingflag. Also use properties such as
experiment_timeout_minutesto control the maximum duration of a run or for the entire experiment.
Use low-priority VMs
Azure allows you to use excess unutilized capacity as Low-Priority VMs across virtual machine scale sets, Batch, and the Machine Learning service. These allocations are pre-emptible but come at a reduced price compared to dedicated VMs. In general, we recommend using Low-Priority VMs for Batch workloads. You should also use them where interruptions are recoverable either through resubmits (for Batch Inferencing) or through restarts (for deep learning training with checkpointing).
Low-Priority VMs have a single quota separate from the dedicated quota value, which is by VM family. Learn more about AmlCompute quotas.
Low-Priority VMs don't work for compute instances, since they need to support interactive notebook experiences.
Use reserved instances
Another way to save money on compute resources is Azure Reserved VM Instance. With this offering, you commit to one-year or three-year terms. These discounts range up to 72% of the pay-as-you-go prices and are applied directly to your monthly Azure bill.
Azure Machine Learning Compute supports reserved instances inherently. If you purchase a one-year or three-year reserved instance, we will automatically apply discount against your Azure Machine Learning managed compute.
- Learn how to optimize your cloud investment with Azure Cost Management.
- Learn more about managing costs with cost analysis.
- Learn about how to prevent unexpected costs.
- Take the Cost Management guided learning course.