Monitoring Azure Machine Learning

This article describes the monitoring data generated by Azure Machine Learning. It also describes how you can use the Azure Monitor to analyze your data and define alerts.

Tip

The information in this document is primarily for administrators, as it describes monitoring for the Azure Machine Learning. If you are a data scientist or developer, and want to monitor information specific to your model training runs, see the following documents:

Azure Monitor

Azure Machine Learning logs monitoring data using Azure Monitor, which is a full stack monitoring service in Azure. Azure Monitor provides a complete set of features to monitor your Azure resources. It can also monitor resources in other clouds and on-premises.

Start with the article Azure Monitor overview, which provides an overview of the monitoring capabilities. The following sections build on this information by providing specifics of using Azure Monitor with Azure Machine Learning.

To understand costs associated with Azure Monitor, see Usage and estimated costs. To understand the time it takes for your data to appear in Azure Monitor, see Log data ingestion time.

Monitoring data from Azure Machine Learning

Azure Machine Learning collects the same kinds of monitoring data as other Azure resources, which are described in Monitoring data from Azure resources. See Azure Machine Learning monitoring data reference for a detailed reference of the logs and metrics created by Azure Machine Learning.

Analyzing metric data

You can analyze metrics for Azure Machine Learning by opening Metrics from the Azure Monitor menu. See Getting started with Azure Metrics Explorer for details on using this tool.

All metrics for Azure Machine Learning are in the namespace Machine Learning Service Workspace.

Metrics Explorer with Machine Learning Service Workspace selected

Filtering and splitting

For metrics that support dimensions, you can apply filters using a dimension value. For example, filtering Active Cores for a Cluster Name of cpu-cluster.

You can also split a metric by dimension to visualize how different segments of the metric compare with each other. For example, splitting out the Pipeline Step Type to see a count of the types of steps used in the pipeline.

For more information of filtering and splitting, see Advanced features of Azure Monitor.

Alerts

You can access alerts for Azure Machine Learning by opening Alerts from the Azure Monitor menu. See Create, view, and manage metric alerts using Azure Monitor for details on creating alerts.

The following table lists common and recommended metric alert rules for Azure Machine Learning:

Alert type Condition Description
Model Deploy Failed Aggregation type: Total, Operator: Greater than, Threshold value: 0 When one or more model deployments have failed
Quota Utilization Percentage Aggregation type: Average, Operator: Greater than, Threshold value: 90 When the quota utilization percentage is greater than 90%
Unusable Nodes Aggregation type: Total, Operator: Greater than, Threshold value: 0 When there are one or more unusable nodes

Configuration

Important

Metrics for Azure Machine Learning do not need to be configured, they are collected automatically and are available in the Metrics Explorer for monitoring and alerting.

You can add a diagnostic setting to configure the following functionality:

  • Archive log and metrics information to an Azure storage account.
  • Stream log and metrics information to an Azure Event Hub.
  • Send log and metrics information to Azure Monitor Log Analytics.

Enabling these settings requires additional Azure services (storage account, event hub, or Log Analytics), which may increase your cost. To calculate an estimated cost, visit the Azure pricing calculator.

For more information on creating a diagnostic setting, see Create diagnostic setting to collect platform logs and metrics in Azure.

You can configure the following logs for Azure Machine Learning:

Category Description
AmlComputeClusterEvent Events from Azure Machine Learning compute clusters.
AmlComputeClusterNodeEvent Events from nodes within an Azure Machine Learning compute cluster.
AmlComputeJobEvent Events from jobs running on Azure Machine Learning compute.

Note

When you enable metrics in a diagnostic setting, dimension information is not currently included as part of the information sent to a storage account, event hub, or log analytics.

Analyzing log data

Using Azure Monitor Log Analytics requires you to create a diagnostic configuration and enable Send information to Log Analytics. For more information, see the Configuration section.

Data in Azure Monitor Logs is stored in tables, with each table having its own set of unique properties. Azure Machine Learning stores data in the following tables:

Table Description
AmlComputeClusterEvent Events from Azure Machine Learning compute clusters.
AmlComputeClusterNodeEvent Events from nodes within an Azure Machine Learning compute cluster.
AmlComputeJobEvent Events from jobs running on Azure Machine Learning compute.

Important

When you select Logs from the Azure Machine Learning menu, Log Analytics is opened with the query scope set to the current workspace. This means that log queries will only include data from that resource. If you want to run a query that includes data from other databases or data from other Azure services, select Logs from the Azure Monitor menu. See Log query scope and time range in Azure Monitor Log Analytics for details.

For a detailed reference of the logs and metrics, see Azure Machine Learning monitoring data reference.

Sample queries

Following are queries that you can use to help you monitor your Azure Machine Learning resources:

  • Get failed jobs in the last five days:

    AmlComputeJobEvent
    | where TimeGenerated > ago(5d) and EventType == "JobFailed"
    | project  TimeGenerated , ClusterId , EventType , ExecutionState , ToolType
    
  • Get records for a specific job name:

    AmlComputeJobEvent
    | where JobName == "automl_a9940991-dedb-4262-9763-2fd08b79d8fb_setup"
    | project  TimeGenerated , ClusterId , EventType , ExecutionState , ToolType
    
  • Get cluster events in the last five days for clusters where the VM size is Standard_D1_V2:

    AmlComputeClusterEvent
    | where TimeGenerated > ago(4d) and VmSize == "STANDARD_D1_V2"
    | project  ClusterName , InitialNodeCount , MaximumNodeCount , QuotaAllocated , QuotaUtilized
    
  • Get nodes allocated in the last eight days:

    AmlComputeClusterNodeEvent
    | where TimeGenerated > ago(8d) and NodeAllocationTime  > ago(8d)
    | distinct NodeId
    

Next steps