Monitor virtual machines with Azure Monitor: Alerts

This article is part of the scenario Monitor virtual machines and their workloads in Azure Monitor. It provides guidance on creating alert rules for your virtual machines and their guest operating systems. Alerts in Azure Monitor proactively notify you of interesting data and patterns in your monitoring data. There are no preconfigured alert rules for virtual machines, but you can create your own based on data collected by VM insights.

Note

This scenario describes how to implement complete monitoring of your Azure and hybrid virtual machine environment. To get started monitoring your first Azure virtual machine, see Monitor Azure virtual machines, Tutorial: Create a metric alert for an Azure resource, or Tutorial: Create alert when Azure virtual machine is unavailable.

Important

Most alert rules have a cost that's dependent on the type of rule, how many dimensions it includes, and how frequently it's run. Before you create any alert rules, refer to Alert rules in Azure Monitor pricing.

Choose the alert type

The most common types of alert rules in Azure Monitor are metric alerts and log query alerts. The type of alert rule that you create for a particular scenario depends on where the data is located that you're alerting on. You might have cases where data for a particular alerting scenario is available in both Metrics and Logs, and you'll need to determine which rule type to use. You might also have flexibility in how you collect certain data and let your decision of alert rule type drive your decision for data collection method.

Typically, the best strategy is to use metric alerts instead of log alerts when possible because they're more responsive and stateful. To use metric alerts, the data you're alerting on must be available in Metrics. VM insights currently sends all of its data to Logs, so you must install the Azure Monitor agent to use metric alerts with data from the guest operating system. Use Log query alerts with metric data when it's unavailable in Metrics or if you require logic beyond the relatively simple logic for a metric alert rule.

Metric alerts

Metric alert rules are useful for alerting when a particular metric exceeds a threshold. An example is when the CPU of a machine is running high. The target of a metric alert rule can be a specific machine, a resource group, or a subscription. In this instance, you can create a single rule that applies to a group of machines.

Metric rules for virtual machines can use the following data:

  • Host metrics for Azure virtual machines, which are collected automatically.
  • Metrics that are collected by the Azure Monitor agent from the guest operating system.

Note

When VM insights supports the Azure Monitor agent, which is currently in public preview, it sends performance data from the guest operating system to Metrics so that you can use metric alerts.

Log alerts

Log alerts can measure two different things which can be used to monitor virtual machines in different scenarios:

  • Result count: Counts the number of rows returned by the query, and can be used to work with events such as Windows event logs, syslog, application exceptions.
  • Calculation of a value: Makes a calculation based on a numeric column, and can be used to include any number of resources. For example, CPU percentage.

Targeting resources and dimensions

You can monitor multiple instances’ values with one rule using dimensions. You would use dimensions if, for example, you want to monitor CPU usage on multiple instances running your web site or app for CPU usage over 80%.

To create resource-centric alerts at scale for a subscription or resource group, you can Split by dimensions. When you want to monitor the same condition on multiple Azure resources, splitting by dimensions splits the alerts into separate alerts by grouping unique combinations using numerical or string columns. Splitting on Azure resource ID column makes the specified resource into the alert target.

You may also decide not to split when you want a condition on multiple resources in the scope, for example, if you want to alert if at least five machines in the resource group scope have CPU usage over 80%.

Screenshot of new log alert rule with split by dimensions.

You might want to see a list of the alerts with the affected computers. You can use a custom workbook that uses a custom Resource Graph to provide this view. Use the following query to display alerts, and use the data source Azure Resource Graph in the workbook.

alertsmanagementresources
| extend dimension = properties.context.context.condition.allOf
| mv-expand dimension
| extend dimension = dimension.dimensions
| mv-expand dimension
| extend Computer = dimension.value
| extend AlertStatus = properties.essentials.alertState
| summarize count() by Alert=name, tostring(AlertStatus), tostring(Computer)
| project Alert, AlertStatus, Computer

Common alert rules

The following section lists common alert rules for virtual machines in Azure Monitor. Details for metric alerts and log metric measurement alerts are provided for each. For guidance on which type of alert to use, see Choose the alert type.

If you're unfamiliar with the process for creating alert rules in Azure Monitor, see the instructions to create a new alert rule.

Machine unavailable

The most basic requirement is to send an alert when a machine is unavailable. It could be stopped, the guest operating system could be unresponsive, or the agent could be unresponsive. There are various ways to configure this alerting, but the most common is to use the heartbeat sent from the Log Analytics agent.

Log query alert rules

Log query alerts use the Heartbeat table, which should have a heartbeat record every minute from each machine.

Use a rule with the following query.

Heartbeat
| summarize TimeGenerated=max(TimeGenerated) by Computer, _ResourceId
| extend Duration = datetime_diff('minute',now(),TimeGenerated)
| summarize AggregatedValue = min(Duration) by Computer, bin(TimeGenerated,5m), _ResourceId

Metric alert rules

A metric called Heartbeat is included in each Log Analytics workspace. Each virtual machine connected to that workspace sends a heartbeat metric value each minute. Because the computer is a dimension on the metric, you can fire an alert when any computer fails to send a heartbeat. Set the Aggregation type to Count and the Threshold value to match the Evaluation granularity.

CPU alerts

Metric alert rules

Target Metric
Host Percentage CPU
Windows guest \Processor Information(_Total)% Processor Time
Linux guest cpu/usage_active

Log alert rules

CPU utilization

InsightsMetrics
| where Origin == "vm.azm.ms"
| where Namespace == "Processor" and Name == "UtilizationPercentage"
| summarize AggregatedValue = avg(Val) by bin(TimeGenerated, 15m), Computer, _ResourceId

Memory alerts

Metric alert rules

Target Metric
Windows guest \Memory% Committed Bytes in Use
\Memory\Available Bytes
Linux guest mem/available
mem/available_percent

Log alert rules

Available memory in MB

InsightsMetrics
| where Origin == "vm.azm.ms"
| where Namespace == "Memory" and Name == "AvailableMB"
| summarize AggregatedValue = avg(Val) by bin(TimeGenerated, 15m), Computer, _ResourceId

Available memory in percentage

InsightsMetrics
| where Origin == "vm.azm.ms"
| where Namespace == "Memory" and Name == "AvailableMB"
| extend TotalMemory = toreal(todynamic(Tags)["vm.azm.ms/memorySizeMB"]) | extend AvailableMemoryPercentage = (toreal(Val) / TotalMemory) * 100.0
| summarize AggregatedValue = avg(AvailableMemoryPercentage) by bin(TimeGenerated, 15m), Computer, _ResourceId  

Disk alerts

Metric alert rules

Target Metric
Windows guest \Logical Disk(_Total)% Free Space
\Logical Disk(_Total)\Free Megabytes
Linux guest disk/free
disk/free_percent

Log query alert rules

Logical disk used - all disks on each computer

InsightsMetrics
| where Origin == "vm.azm.ms"
| where Namespace == "LogicalDisk" and Name == "FreeSpacePercentage"
| summarize AggregatedValue = avg(Val) by bin(TimeGenerated, 15m), Computer, _ResourceId 

Logical disk used - individual disks

InsightsMetrics
| where Origin == "vm.azm.ms"
| where Namespace == "LogicalDisk" and Name == "FreeSpacePercentage"
| extend Disk=tostring(todynamic(Tags)["vm.azm.ms/mountId"])
| summarize AggregatedValue = avg(Val) by bin(TimeGenerated, 15m), Computer, _ResourceId, Disk 

Logical disk IOPS

InsightsMetrics
| where Origin == "vm.azm.ms" 
| where Namespace == "LogicalDisk" and Name == "TransfersPerSecond"
| extend Disk=tostring(todynamic(Tags)["vm.azm.ms/mountId"])
| summarize AggregatedValue = avg(Val) by bin(TimeGenerated, 15m), Computer, _ResourceId, Disk 

Logical disk data rate

InsightsMetrics
| where Origin == "vm.azm.ms" 
| where Namespace == "LogicalDisk" and Name == "BytesPerSecond"
| extend Disk=tostring(todynamic(Tags)["vm.azm.ms/mountId"])
| summarize AggregatedValue = avg(Val) by bin(TimeGenerated, 15m), Computer, _ResourceId, Disk 

Network alerts

Metric alert rules

Target Metric
Windows guest \Network Interface\Bytes Sent/sec
\Logical Disk(_Total)\Free Megabytes
Linux guest disk/free
disk/free_percent

Log query alert rules

Network interfaces bytes received - all interfaces

InsightsMetrics
| where Origin == "vm.azm.ms"
| where Namespace == "Network" and Name == "ReadBytesPerSecond"
| summarize AggregatedValue = avg(Val) by bin(TimeGenerated, 15m), Computer, _ResourceId  

Network interfaces bytes received - individual interfaces

InsightsMetrics
| where Origin == "vm.azm.ms"
| where Namespace == "Network" and Name == "ReadBytesPerSecond"
| extend NetworkInterface=tostring(todynamic(Tags)["vm.azm.ms/networkDeviceId"])
| summarize AggregatedValue = avg(Val) by bin(TimeGenerated, 15m), Computer, _ResourceId, NetworkInterface 

Network interfaces bytes sent - all interfaces

InsightsMetrics
| where Origin == "vm.azm.ms"
| where Namespace == "Network" and Name == "WriteBytesPerSecond"
| summarize AggregatedValue = avg(Val) by bin(TimeGenerated, 15m), Computer, _ResourceId 

Network interfaces bytes sent - individual interfaces

InsightsMetrics
| where Origin == "vm.azm.ms"
| where Namespace == "Network" and Name == "WriteBytesPerSecond"
| extend NetworkInterface=tostring(todynamic(Tags)["vm.azm.ms/networkDeviceId"])
| summarize AggregatedValue = avg(Val) by bin(TimeGenerated, 15m), Computer, _ResourceId, NetworkInterface 

Example log query alert

Here's a walk-through of creating a log alert for when the CPU of a virtual machine exceeds 80 percent. The data you need is in the InsightsMetrics table. The following query returns the records that need to be evaluated for the alert. Each type of alert rule uses a variant of this query.

Create the log alert rule

  1. In the portal, select the relevant resource. We recommend scaling resources by using subscriptions or resource groups.

  2. In the Resource menu, select Logs.

  3. Use this query to monitor for virtual machines CPU usage:

    InsightsMetrics
    | where Origin == "vm.azm.ms"
    | where Namespace == "Processor" and Name == "UtilizationPercentage"
    | summarize AggregatedValue = avg(Val) by bin(TimeGenerated, 15m), Computer, _ResourceId
    
  4. Run the query to make sure you get the results you were expecting.

  5. From the top command bar, Select + New alert rule to create a rule using the current query.

  6. The Create an alert rule page opens with your query. We try to detect summarized data from the query results automatically. If detected, the appropriate values are automatically selected. Screenshot of new log alert rule query.

  7. In the Measurement section, select the values for these fields if they are not already automatically selected.

    Field Description Value for this scenario
    Measure The number of table rows or a numeric column to aggregate AggregatedValue
    Aggregation type The type of aggregation to apply to the data points in aggregation granularity Average
    Aggregation granularity The interval over which data points are grouped by the aggregation type 15 minutes

    Screenshot of new log alert rule measurement.

  8. In the Split by dimensions section, select the values for these fields if they are not already automatically selected.

    Field Description Value for this scenario
    Resource ID column An Azure Resource ID column that will split the alerts and set the fired alert target scope. _Resourceid
    Dimension name Dimensions monitor specific time series and provide context to the fired alert. Dimensions can be either number or string columns. If you select more than one dimension value, each time series that results from the combination will trigger its own alert and will be charged separately. The displayed dimension values are based on data from the last 48 hours. Custom dimension values can be added by clicking 'Add custom value'. Computer
    Operator The operator to compare the dimension value =
    Dimension value The list of dimension column values All current and future values

    Screenshot of new log alert rule with dimensions.

  9. In the Alert Logic section, select the values for these fields if they are not already automatically selected.

    Field Description Value for this scenario
    Operator The operator to compare the metric value against the threshold Greater than
    Threshold value The value that the result is measured against. 80
    Frequency of evaluation How often the alert rule should run. A frequency smaller than the aggregation granularity results in a sliding window evaluation. 15 minutes
  10. (Optional) In the Advanced options section, set the Number of violations to trigger alert. Screenshot of alerts rule preview advanced options.

  11. The Preview chart shows query evaluations results over time. You can change the chart period or select different time series that resulted from unique alert splitting by dimensions. Screenshot of alerts rule preview.

  12. From this point on, you can select the Review + create button at any time.

  13. In the Actions tab, select or create the required action groups. Screenshot of alerts rule preview actions tab.

  14. In the Details tab, define the Project details and the Alert rule details.

  15. (Optional) In the Advanced options section, you can set several options, including whether to Enable upon creation, or to mute actions for a period after the alert rule fires. Screenshot of alerts rule preview details tab. > [!NOTE] > If you or your administrator assigned the Azure Policy Azure Log Search Alerts over Log Analytics workspaces should use customer-managed keys, you must select Check workspace linked storage option in Advanced options, or the rule creation will fail as it will not meet the policy requirements.

  16. In the Tags tab, set any required tags on the alert rule resource. Screenshot of alerts rule preview tags tab.

  17. In the Review + create tab, a validation will run and inform you of any issues.

  18. When validation passes and you have reviewed the settings, click the Create button.
    Screenshot of alerts rule preview review and create tab.

Next steps