您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

了解指标警报在 Azure Monitor 中的工作原理Understand how metric alerts work in Azure Monitor

Azure Monitor 中的指标警报建立在多维指标的基础之上。Metric alerts in Azure Monitor work on top of multi-dimensional metrics. 这些指标可能是平台指标自定义指标Azure Monitor 中已转换为指标的常用日志,以及 Application Insights 指标。These metrics could be platform metrics, custom metrics, popular logs from Azure Monitor converted to metrics and Application Insights metrics. 指标警报定期评估,以检查一个或多个指标时序的条件是否属实,并在符合评估条件时发出通知。Metric alerts evaluate at regular intervals to check if conditions on one or more metric time-series are true and notify you when the evaluations are met. 指标警报是有状态的,即,它们只会在状态有更改时才发出通知。Metric alerts are stateful, that is, they only send out notifications when the state changes.

指标警报的工作原理How do metric alerts work?

可以通过指定要监视的目标资源、指标名称、条件类型(静态或动态)和条件(运算符和阈值/敏感度),以及警报规则激发时要触发的操作组,来定义指标警报规则。You can define a metric alert rule by specifying a target resource to be monitored, metric name, condition type (static or dynamic), and the condition (an operator and a threshold/sensitivity) and an action group to be triggered when the alert rule fires. 条件类型影响阈值的确定方式。Condition types affect the way thresholds are determined. 详细了解动态阈值条件类型和敏感度选项Learn more about Dynamic Thresholds condition type and sensitivity options.

使用静态条件类型的警报规则Alert rule with static condition type

假设你创建了一个如下所述的简单静态阈值指标警报规则:Let's say you have created a simple static threshold metric alert rule as follows:

  • 目标资源(要监视的 Azure 资源):myVMTarget Resource (the Azure resource you want to monitor): myVM
  • 指标:CPU 百分比Metric: Percentage CPU
  • 条件类型:静态Condition Type: Static
  • 时间聚合(基于原始指标值运行的统计信息。Time Aggregation (Statistic that is run over raw metric values. 支持的时间聚合为最小值、最大值、平均值、总计、计数):平均值Supported time aggregations are Min, Max, Avg, Total, Count): Average
  • 期限(检查指标值时所依据的回溯时段):过去 5 分钟Period (The look back window over which metric values are checked): Over the last 5 mins
  • 频率(指标警报检查是否符合条件的频率):1 分钟Frequency (The frequency with which the metric alert checks if the conditions are met): 1 min
  • 运算符:大于Operator: Greater Than
  • 阈值:70Threshold: 70

从创建警报规则的时间开始,监视器将每隔 1 分钟运行,查看过去 5 分钟的指标值,并检查这些值的平均值是否超过 70。From the time the alert rule is created, the monitor runs every 1 min and looks at metric values for the last 5 minutes and checks if the average of those values exceeds 70. 如果符合条件(即,过去 5 分钟的平均 CPU 百分比超过 70),则警报规则将激发激活的通知。If the condition is met that is, the average Percentage CPU for the last 5 minutes exceeds 70, the alert rule fires an activated notification. 如果在与警报规则关联的操作组中配置了电子邮件或 Webhook,则两者都会收到激活的通知。If you have configured an email or a web hook action in the action group associated with the alert rule, you will receive an activated notification on both.

在一条规则中使用多个条件时,该规则会将这些条件使用“and”连接在一起。When you are using multiple conditions in one rule, the rule "ands" the conditions together. 也就是说,当警报规则中的所有条件均评估为 true 时触发警报,在其中一个条件不再为 true 时解除警报。That is, an alert fires when all the conditions in the alert rule evaluate as true and resolve when one of the conditions is no longer true. 这种类型的警报规则的一个示例是监视 Azure 虚拟机,并在“CPU 百分比高于 90%”且“队列长度超过 300 个项目”时发出警报。An example for this type of alert rule would be to monitor an Azure virtual machine and alert when both "Percentage CPU is higher than 90%" and "Queue length is over 300 items".

使用动态条件类型的警报规则Alert rule with dynamic condition type

假设你创建了一个如下所述的简单动态阈值指标警报规则:Let's say you have created a simple Dynamic Thresholds metric alert rule as follows:

  • 目标资源(要监视的 Azure 资源):myVMTarget Resource (the Azure resource you want to monitor): myVM
  • 指标:CPU 百分比Metric: Percentage CPU
  • 条件类型:动态Condition Type: Dynamic
  • 时间聚合(基于原始指标值运行的统计信息。Time Aggregation (Statistic that is run over raw metric values. 支持的时间聚合为最小值、最大值、平均值、总计、计数):平均值Supported time aggregations are Min, Max, Avg, Total, Count): Average
  • 期限(检查指标值时所依据的回溯时段):过去 5 分钟Period (The look back window over which metric values are checked): Over the last 5 mins
  • 频率(指标警报检查是否符合条件的频率):1 分钟Frequency (The frequency with which the metric alert checks if the conditions are met): 1 min
  • 运算符:大于Operator: Greater Than
  • 敏感度:中型Sensitivity: Medium
  • 回溯时段:4Look Back Periods: 4
  • 违规次数:4Number of Violations: 4

创建警报规则后,动态阈值机器学习算法将获取可用的历史数据,计算最适合指标系列行为模式的阈值,并基于新数据持续学习,使阈值更准确。Once the alert rule is created, the Dynamic Thresholds machine learning algorithm will acquire historical data that is available, calculate threshold that best fits the metric series behavior pattern and will continuously learn based on new data to make the threshold more accurate.

从创建警报规则时开始算起,监视器将每隔 1 分钟运行,查看过去 20 分钟内每 5 分钟时段的指标值,并检查每个时段(共 4 个时段)的平均值是否超过预期阈值。From the time the alert rule is created, the monitor runs every 1 min and looks at metric values in the last 20 minutes grouped into 5 minutes periods and checks if the average of the period values in each of the 4 periods exceeds the expected threshold. 如果符合条件,即,过去 20 分钟(有 4 个 5 分钟时段)的平均 CPU 百分比与预期行为相差 4 倍,则警报规则将激发激活的通知。If the condition is met that is, the average Percentage CPU in the last 20 minutes (four 5 minutes periods) deviated from expected behavior four times, the alert rule fires an activated notification. 如果在与警报规则关联的操作组中配置了电子邮件或 Webhook,则两者都会收到激活的通知。If you have configured an email or a web hook action in the action group associated with the alert rule, you will receive an activated notification on both.

查看并解决已激发的警报View and resolution of fired alerts

也可以在 Azure 门户的“所有警报”边栏选项卡中查看上述警报规则激发示例。The above examples of alert rules firing can also be viewed in the Azure portal in the All Alerts blade.

假设在后续的检查中,“myVM”上的用量持续超过阈值,则在解决这种状况之前,警报规则不会再次激发。Say the usage on "myVM" continues being above the threshold in subsequent checks, the alert rule will not fire again until the conditions are resolved.

一段时间后,“myVM”上的用量回归正常(低于阈值)。After some time, the usage on "myVM" comes back down to normal (goes below the threshold). 则警报规则将再监视条件两次,然后发出“已解决”通知。The alert rule monitors the condition for two more times, to send out a resolved notification. 如果在三个连续的期限内都不符合警报条件,则警报规则会发出“已解决”/“已停用”消息,以便在不稳定的环境中减少干扰。The alert rule sends out a resolved/deactivated message when the alert condition is not met for three consecutive periods to reduce noise in case of flapping conditions.

通过 Webhook 或电子邮件发出“已解决”通知后,Azure 门户中警报实例的状态(称为“监视状态”)也会设置为“已解决”。As the resolved notification is sent out via web hooks or email, the status of the alert instance (called monitor state) in Azure portal is also set to resolved.

使用维度Using dimensions

Azure Monitor 中的指标警报还支持使用一个规则来监视多个维度值的组合。Metric alerts in Azure Monitor also support monitoring multiple dimensions value combinations with one rule. 下面举例说明为何使用多维组合会有所帮助。Let's understand why you might use multiple dimension combinations with the help of an example.

假设你为网站创建了一个应用服务计划。Say you have an App Service plan for your website. 你想要监视运行网站/应用的多个实例上的 CPU 使用率。You want to monitor CPU usage on multiple instances running your web site/app. 可以使用如下所述的指标警报规则实现此目的:You can do that using a metric alert rule as follows:

  • 目标资源:myAppServicePlanTarget resource: myAppServicePlan
  • 指标:CPU 百分比Metric: Percentage CPU
  • 条件类型:静态Condition Type: Static
  • 维度Dimensions
    • 实例 = InstanceName1、InstanceName2Instance = InstanceName1, InstanceName2
  • 时间聚合:平均值Time Aggregation: Average
  • 时间段:过去 5 分钟Period: Over the last 5 mins
  • 频率:1 分钟Frequency: 1 min
  • 运算符:GreaterThanOperator: GreaterThan
  • 阈值:70Threshold: 70

如前所述,此规则会监视过去 5 分钟的平均 CPU 使用率是否超过 70%。Like before, this rule monitors if the average CPU usage for the last 5 minutes exceeds 70%. 但是,使用相同的规则可以监视运行网站的两个实例。However, with the same rule you can monitor two instances running your website. 每个实例单独受到监视,而你会分别收到不同的通知。Each instance will get monitored individually and you will get notifications individually.

假设你的 Web 应用需要应对很高的需求,因此你需要添加更多的实例。Say you have a web app that is seeing massive demand and you will need to add more instances. 上述规则依然只会监视两个实例。The above rule still monitors just two instances. 但是,可以创建如下所述的规则:However, you can create a rule as follows:

  • 目标资源:myAppServicePlanTarget resource: myAppServicePlan
  • 指标:CPU 百分比Metric: Percentage CPU
  • 条件类型:静态Condition Type: Static
  • 维度Dimensions
    • 实例 = *Instance = *
  • 时间聚合:平均值Time Aggregation: Average
  • 时间段:过去 5 分钟Period: Over the last 5 mins
  • 频率:1 分钟Frequency: 1 min
  • 运算符:GreaterThanOperator: GreaterThan
  • 阈值:70Threshold: 70

此规则将自动监视实例的所有值,即,This rule will automatically monitor all values for the instance i.e 可以在实例联机时对其进行监视,而无需再次修改指标警报规则。you can monitor your instances as they come up without needing to modify your metric alert rule again.

监视多个维度时,动态阈值警报规则一次可为数百个指标系列创建定制的阈值。When monitoring multiple dimensions, Dynamic Thresholds alerts rule can create tailored thresholds for hundreds of metric series at a time. 动态阈值可以减少要管理的警报规则数目,并可以大幅节省警报规则的管理和创建时间。Dynamic Thresholds results in fewer alert rules to manage and significant time saving on management and creation of alerts rules.

假设你的 Web 应用包含许多实例,而你不知道什么最合适的阈值是什么。Say you have a web app with many instances and you don't know what the most suitable threshold is. 上述规则始终使用阈值 70%。The above rules will always use threshold of 70%. 但是,可以创建如下所述的规则:However, you can create a rule as follows:

  • 目标资源:myAppServicePlanTarget resource: myAppServicePlan
  • 指标:CPU 百分比Metric: Percentage CPU
  • 条件类型:动态Condition Type: Dynamic
  • 维度Dimensions
    • 实例 = *Instance = *
  • 时间聚合:平均值Time Aggregation: Average
  • 时间段:过去 5 分钟Period: Over the last 5 mins
  • 频率:1 分钟Frequency: 1 min
  • 运算符:GreaterThanOperator: GreaterThan
  • 敏感度:中型Sensitivity: Medium
  • 回溯时段:1Look Back Periods: 1
  • 违规次数:1Number of Violations: 1

此规则会监视过去 5 分钟的平均 CPU 使用率是否超过每个实例的预期行为。This rule monitors if the average CPU usage for the last 5 minutes exceeds the expected behavior for each instance. 同一规则可以在实例联机时对其进行监视,而无需再次修改指标警报规则。The same rule you can monitor instances as they come up without needing to modify your metric alert rule again. 每个实例将获得一个符合指标系列行为模式的阈值,并基于新数据持续进行更改,使阈值更准确。Each instance will get a threshold that fits the metric series behavior pattern and will continuously change based on new data to make the threshold more accurate. 如前所述,每个实例单独受到监视,而你会分别收到不同的通知。Like before, each instance will be monitored individually and you will get notifications individually.

增加回溯时段和违规次数还可以将警报筛选为针对重大偏差定义的警报。Increasing look-back periods and number of violations can also allow filtering alerts to only alert on your definition of a significant deviation. 详细了解动态阈值高级选项Learn more about Dynamic Thresholds advanced options.

备注

建议选择大于评估频率的聚合粒度(周期),以降低在以下情况下错过对已添加的时序进行首次评估的可能性 :We recommend choosing an Aggregation granularity (Period) that is larger than the Frequency of evaluation, to reduce the likelihood of missing the first evaluation of added time series in the following cases:

  • 监视多个维度的指标警报规则–添加新维度值组合时Metric alert rule that monitors multiple dimensions – When a new dimension value combination is added
  • 监视多个资源的指标警报规则-将新资源添加到作用域时Metric alert rule that monitors multiple resources – When a new resource is added to the scope
  • 用于监视不连续 (稀疏指标) 的指标的指标警报规则–在超过24小时的时间段内发出指标时,未发出此指标Metric alert rule that monitors a metric that isn’t emitted continuously (sparse metric) – When the metric is emitted after a period longer than 24 hours in which it wasn’t emitted

使用 Azure Monitor 中的指标警报进行大规模监视Monitoring at scale using metric alerts in Azure Monitor

到目前为止,已了解了如何使用单个指标警报监视与单个 Azure 资源相关的一个或多个指标时序。So far, you have seen how a single metric alert could be used to monitor one or many metric time-series related to a single Azure resource. 很多时候,你可能希望将同一预警规则应用于许多资源。Many times, you might want the same alert rule applied to many resources. 对于存在于同一 Azure 区域中的资源,Azure Monitor 还支持使用一个指标警报规则监视多个资源(属于同一类型)。Azure Monitor also supports monitoring multiple resources (of the same type) with one metric alert rule, for resources that exist in the same Azure region.

目前,以下 Azure 云中的以下服务的平台指标(非自定义指标)支持此功能:This feature is currently supported for platform metrics (not custom metrics) for the following services in the following Azure clouds:

服务Service 公共 AzurePublic Azure GovernmentGovernment 中国China
虚拟机1Virtual machines1 Yes No No
SQL 服务器数据库SQL server databases Yes Yes Yes
SQL 服务器弹性池SQL server elastic pools Yes Yes Yes
NetApp 文件容量池NetApp files capacity pools Yes Yes Yes
NetApp 文件卷NetApp files volumes Yes Yes Yes
Key VaultKey vaults Yes Yes Yes
用于 Redis 的 Azure 缓存Azure Cache for Redis Yes Yes Yes
Data Box Edge 设备Data box edge devices Yes Yes Yes

1 不支持虚拟机网络指标 (网络总数、网络传出总数、入站流、出站流、入站流最大创建速率、出站流) 的最大创建速率。1 Not supported for virtual machine network metrics (Network In Total, Network Out Total, Inbound Flows, Outbound Flows, Inbound Flows Maximum Creation Rate, Outbound Flows Maximum Creation Rate).

可以通过以下三种方式之一指定单个指标警报规则的监视范围。You can specify the scope of monitoring by a single metric alert rule in one of three ways. 例如,对于虚拟机,可以将范围指定为:For example, with virtual machines you can specify the scope as:

  • 单个订阅内的虚拟机列表(在单个 Azure 区域中)a list of virtual machines (in one Azure region) within a subscription
  • 指定为单个订阅中一个或多个资源组中的所有虚拟机(在单个 Azure 区域中)all virtual machines (in one Azure region) in one or more resource groups in a subscription
  • 单个订阅中的所有虚拟机(在单个 Azure 区域中)all virtual machines (in one Azure region) in a subscription

备注

多资源指标警报规则的范围必须至少包含所选资源类型的一个资源。The scope of a multi-resource metric alert rule must contain at least one resource of the selected resource type.

创建监视多个资源的指标预警规则类似于创建监视单个资源的任何其他指标警报Creating metric alert rules that monitor multiple resources is like creating any other metric alert that monitors a single resource. 唯一区别是,你将选择要监视的所有资源。Only difference is that you would select all the resources you want to monitor. 也可以通过 Azure 资源管理器模板创建这些规则。You can also create these rules through Azure Resource Manager templates. 对于每个受监视资源,你将收到单独的通知。You will receive individual notifications for each monitored resource.

备注

在监视多个资源的指标警报规则中,仅允许包含一个条件。In a metric alert rule that monitors multiple resources, only one condition is allowed.

典型延迟Typical latency

对于指标警报,如果将警报规则频率设置为 1 分钟,则通常会在 5 分钟以内收到通知。如果通知系统承受很高的负载,则延迟时间可能更长。For metric alerts, typically you will get notified in under 5 minutes if you set the alert rule frequency to be 1 min. In cases of heavy load for notification systems, you might see a longer latency.

指标警报支持的资源类型Supported resource types for metric alerts

可在此文中找到受支持资源类型的完整列表。You can find the full list of supported resource types in this article.

后续步骤Next steps