Configure monitoring in VM insights guest health using data collection rules (preview)
VM insights guest health allows you to view the health of a virtual machine as defined by a set of performance measurements that are sampled at regular intervals. This article describes how you can modify default monitoring across multiple virtual machines using data collection rules.
Monitors
The health state of a virtual machine is determined by the rollup of health from each of its monitors. There are two types of monitors in VM insights guest health as shown in the following table.
Monitor | Description |
---|---|
Unit monitor | Measures some aspect of a resource or application. This might be checking a performance counter to determine the performance of the resource, or its availability. |
Aggregate Monitor | Groups multiple monitors to provide a single aggregated health state. An aggregate monitor can contain one or more unit monitors and other aggregate monitors. |
The set of monitors used by VM insights guest health and their configuration can't be directly changed. You can create overrides though which modify the behavior of the default configuration. Overrides are defined in data collection rules. You can create multiple data collection rules each containing multiple overrides to achieve your required monitoring configuration.
Monitor properties
The following table describes the properties that can be configured on each monitor.
Property | Monitors | Description |
---|---|---|
Enabled | Aggregate Unit |
If true, the state monitor is calculated and contributes to the health of the virtual machine. It can trigger an alert of alerting is enabled. |
Alerting | Aggregate Unit |
If true, an alert is triggered for the monitor when it moves to an unhealthy state. If false, the state of the monitor will still contribute to the health of the virtual machine which could trigger an alert. |
Warning | Unit | Criteria for the warning state. If none, then the monitor will never enter a warning state. |
Critical | Unit | Criteria for the critical state. If none, then the monitor will never enter a critical state. |
Evaluation frequency | Unit | Frequency the health state is evaluated. |
Lookback | Unit | Size of lookback window in seconds. See monitorConfiguration element for detailed description. |
Evaluation Type | Unit | Defines which value to use from the sample set. See monitorConfiguration element for detailed description. |
Min sample | Unit | Minimum number of values to use to calculate value. |
Max sample | Unit | Maximum number of values to use to calculate value. |
Default configuration
The following table lists the default configuration for each monitor. This default configuration can't be directly changed, but you can define overrides that will modify the monitor configuration for certain virtual machines.
Monitor | Enabled | Alerting | Warning | Critical | Evaluation frequency | Lookback | Evaluation type | Min sample | Max samples |
---|---|---|---|---|---|---|---|---|---|
CPU utilization | True | False | None | > 90% | 60 sec | 240 sec | Min | 2 | 3 |
Available memory | True | False | None | < 100 MB | 60 sec | 240 sec | Max | 2 | 3 |
File system | True | False | None | < 100 MB | 60 sec | 120 sec | Max | 1 | 1 |
Overrides
An override changes one ore more properties of a monitor. For example, an override could disable a monitor that's enabled by default, define warning criteria for the monitor, or modify the monitor's critical threshold.
Overrides are defined in a Data Collection Rule (DCR). You can create multiple DCRs with different sets of overrides and apply them to multiple virtual machines. You apply a DCR to a virtual machine by creating an association as described in Configure data collection for the Azure Monitor agent (preview).
Multiple overrides
A single monitor may have multiple overrides. If the overrides define different properties, then the resulting configuration is a combination of all the overrides.
For example, the memory|available
monitor does not specify a warning threshold or enable alerting by default. Consider the following overrides applied to this monitor:
- Override 1 defines
alertConfiguration.isEnabled
property value astrue
- Override 2 defines
monitorConfiguration.warningCondition
with with a threshold condition of< 250
.
The resulting configuration would be a monitor that goes into a warning health state when less than 250Mb of memory is available and creates Severity 2 alert and also goes into critical health state when less than 100Mb of available memory is available and creates alert Severity 1 (or changes existing alert from severity 2 to 1 if it already existed).
If two overrides define the same property on the same monitor, one value will take precedence. Overrides will be applied based on their scope, from the most general to the most specific. This means that the most specific overrides will have the greatest chance of being applied. The specific order is as follows:
- Global
- Subscription
- Resource group
- Virtual machine.
If multiple overrides at the same scope level define the same property on the same monitor, then they are applied in the order they appear in the DCR. If the overrides are in different DCRs, then they are applied in alphabetical order of the DCR resource IDs.
Data collection rule configuration
The JSON elements in the data collection rule that define overrides are described in the following sections. A complete example is provided in Sample data collection rule.
extensions structure
Guest health is implemented as an extension to the Azure Monitor agent, so overrides are defined in the extensions
element of the data collection rule.
"extensions": [
{
"name": "Microsoft-VMInsights-Health",
"streams": [
"Microsoft-HealthStateChange"
],
"extensionName": "HealthExtension",
"extensionSettings": { }
}
]
Element | Required | Description |
---|---|---|
name |
Yes | User defined string for the extension. |
streams |
Yes | List of streams that guest health data will be sent to. This must include Microsoft-HealthStateChange. |
extensionName |
Yes | Name of the extension. This must be HealthExtension. |
extensionSettings |
Yes | Array of healthRuleOverride elements to be applied to default configuration. |
extensionSettings element
Contains settings for the extension.
"extensionSettings": {
"schemaVersion": "1.0",
"contentVersion": "",
"healthRuleOverrides": [ ]
}
Element | Required | Description |
---|---|---|
schemaVersion |
Yes | String defined by Microsoft to represent expected schema of the element. Currently must be set to 1.0 |
contentVersion |
No | String defined by user to track different versions of the health configuration, if required. |
healthRuleOverrides |
Yes | Array of healthRuleOverride elements to be applied to default configuration. |
healthRulesOverrides element
Contains one or more healthRuleOverride
elements that each define an override.
"healthRuleOverrides": [
{
"scopes": [ ],
"monitors": [ ],
"monitorConfiguration": { },
"alertConfiguration": { },
"isEnabled": true|false
}
]
Element | Required | Description |
---|---|---|
scopes |
Yes | List of one or more scopes that specify the virtual machines to which this override is applicable. Even though the DCR is associated with a virtual machine, the virtual machine must fall within a scope for the override to be applied. |
monitors |
Yes | List of one or more strings that define which monitors will receive this override. |
monitorConfiguration |
No | Configuration for the monitor including health states and how they are calculated. |
alertConfiguration |
No | Alerting configuration for the monitor. |
isEnabled |
No | Controls whether monitor is enabled or not. Disabled monitor switches to special Disabled health state and states disabled unless re-enabled. If omitted, monitor will inherit its status from parent monitor in the hierarchy. |
scopes element
Each overrides has one or more scopes the define which virtual machines the override should be applied to. The scope can be a subscription, resource group, or a single virtual machine. Even if the override is in a DCR associated to a particular virtual machine, it's only applied to that virtual machine if the virtual machine is within one of the scopes of the override. This allows you to broadly associate a smaller number of DCRs to a set of VMs but provide granular control over the assignment of each override within the DCR itself. You may want to create small set of DCRs and association those to a set of virtual machines using policy while specifying health monitor overrides for different subsets of those virtual machines using scopes element.
The following table shows examples of different scopes.
Scope | Example |
---|---|
Single virtual machine | /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/rg-name/providers/Microsoft.Compute/virutalMachines/my-vm |
All virtual machines in a resource group | /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/rg-name |
All virtual machines in a subscription | /subscriptions/00000000-0000-0000-0000-000000000000/ |
All virtual machines the data collection rule is associated with | * |
monitors element
List of one or more strings that define which monitors in health hierarchy will receive this override. Each element can be a monitor name or type name that matches one or more monitors that will receive this override.
"monitors": [
"<monitor name>"
],
The following table lists the current available monitor names.
Type name | Name | Description | ||
---|---|---|---|---|
root | root | Top level monitor representing virtual machine health. | ||
cpu-utilization | cpu-utilization | CPU utilization monitor. | ||
logical-disks | logical-disks | Aggregate monitor for health state of all monitored disks on Windows virtual machine. | ||
logical-disks|* | logical-disks|C: logical-disks|D: |
Aggregate monitor tracking health of a given disk on Windows virtual machine. | ||
logical-disks|*|free-space | logical-disks|C:|free-space logical-disks|D:|free-space |
Disk free space monitor on Windows virtual machine. | ||
filesystems | filesystems | Aggregate monitor for health of all filesystems on Linux virtual machine. | ||
filesystems|* | filesystems|/ filesystems|/mnt |
Aggregate monitor tracking health of a filesystem of Linux virtual machine. | filesystems | /var/log |
filesystems|*|free-space | filesystems|/|free-space filesystems|/mnt|free-space |
Disk free space monitor on Linux virtual machine filesystem. | ||
memory | memory | Aggregate monitor for health of virtual machine memory. | ||
memory|available | memory|available | Monitor tracking available memory on the virtual machine. |
alertConfiguration element
Specifies whether an alert should be created from the monitor.
"alertConfiguration": {
"isEnabled": true|false
}
Element | Mandatory | Description |
---|---|---|
isEnabled |
No | If set to true, monitor will generate alert when switching to a critical or warning state and resolve alert when returning to healthy state. If false or omitted, no alert is generated. |
monitorConfiguration element
Applies only to unit monitors. Defines the configuration for the monitor including health states and how they are calculated.
Parameters define the algorithm to calculate the metric value to compare against thresholds. Instead of acting on one sample of data from underlying metric, the monitor evaluates several metric samples received within window from evaluation time and lookbackSec
ago. All samples received within that timeframe are considered and if count of samples is greater than maxSamples
, older samples above maxSamples
are ignored.
In case there are less samples in lookback interval than minSamples
, monitor will switch in to Unknown health state indicating there isn’t enough data to make informed decision about health of underlying metrics. If greater number of samples then minSamples
is available, an aggregation function specified by evaluationType parameter us run over the set to calculate a single value.
"monitorConfiguration": {
"evaluationType" : "<type-of-evaluation>",
"lookbackSecs": <lookback-number-of-seconds>,
"evaluationFrequencySecs": <evaluation-frequency-number-of-seconds>,
"minSamples": <minimum-samples>,
"maxSamples": <maximum-samples>,
"warningCondition": { },
"criticalCondition": { }
}
}
Element | Mandatory | Description |
---|---|---|
evaluationFrequencySecs |
No | Defines frequency for health state evaluation. Each monitor is evaluated at the time the agent starts and on a regular interval defined by this parameter thereafter. |
lookbackSecs |
No | Size of lookback window in seconds. |
evaluationType |
No | min – take minimum value from entire sample setmax - take maximum value from entire sample setavg – take average of samples set valuesall – compare every single value in the set to thresholds. Monitor switches state if and only if all samples in the set satisfy threshold condition. |
minSamples |
No | Minimum number of values to use to calculate value. |
maxSamples |
No | Maximum number of values to use to calculate value. |
warningCondition |
No | Threshold and comparison logic for the warning condition. |
criticalCondition |
No | Threshold and comparison logic for the critical condition. |
warningCondition element
Defines threshold and comparison logic for the warning condition. If this element isn't included, then the monitor will never switch to the warning health state.
"warningCondition": {
"isEnabled": true|false,
"operator": "<comparison-operator>",
"threshold": <threshold-value>
},
Property | Mandatory | Description |
---|---|---|
isEnabled |
No | Specifies whether condition is enabled. If set to false, condition is disabled even though threshold and operator properties may be set. |
threshold |
No | Defines threshold to compare evaluated value. |
operator |
No | Defines comparison operator to use in threshold expression. Possible values: >, <, >=, <=, ==. |
criticalCondition element
Defines threshold and comparison logic for the critical condition. If this element isn't included, then the monitor will never switch to the critical health state.
"criticalCondition": {
"isEnabled": true|false,
"operator": "<comparison-operator>",
"threshold": <threshold-value>
},
Property | Mandatory | Description |
---|---|---|
isEnabled |
No | Specifies whether condition is enabled. If set to false, condition is disabled even though threshold and operator properties may be set. |
threshold |
No | Defines threshold to compare evaluated value. |
operator |
No | Defines comparison operator to use in threshold expression. Possible values: >, <, >=, <=, ==. |
Sample data collection rule
For a sample data collection rule enabling guest monitoring, see Enable a virtual machine using Resource Manager template.
Next steps
- Read more about data collection rules.