Log alerts in Azure Monitor

Overview

Log alerts are one of the alert types that are supported in Azure Alerts. Log alerts allow users to use a Log Analytics query to evaluate resources logs every set frequency, and fire an alert based on the results. Rules can trigger one or more actions using Action Groups.

Note

Log data from a Log Analytics workspace can be sent to the Azure Monitor metrics store. Metrics alerts have different behavior, which may be more desirable depending on the data you are working with. For information on what and how you can route logs to metrics, see Metric Alert for Logs.

Prerequisites

Log alerts run queries on Log Analytics data. First you should start collecting log data and query the log data for issues. You can use the alert query examples article in Log Analytics to understand what you can discover or get started on writing your own query.

Azure Monitoring Contributor is a common role that is needed for creating, modifying, and updating log alerts. Access & query execution rights for the resource logs are also needed. Partial access to resource logs can fail queries or return partial results. Learn more about configuring log alerts in Azure.

Note

Log alerts for Log Analytics used to be managed using the legacy Log Analytics Alert API. Learn more about switching to the current ScheduledQueryRules API.

Query evaluation definition

Log search rules condition definition starts from:

  • What query to run?
  • How to use the results?

The following sections describe the different parameters you can use to set the above logic.

Log query

The Log Analytics query used to evaluate the rule. The results returned by this query are used to determine whether an alert is to be triggered. The query can be scoped to:

  • A specific resource, such as a virtual machine.
  • An at scale resource, such as a subscription or resource group.
  • Multiple resources using cross-resource query.

Important

Alert queries have constraints to ensure optimal performance and the relevance of the results. Learn more here.

Important

Resource centric and cross-resource query are only supported using the current scheduledQueryRules API. If you use the legacy Log Analytics Alert API, you will need to switch. Learn more about switching

Query time Range

Time range is set in the rule condition definition. In workspaces and Application Insights, it's called Period. In all other resource types, it's called Override query time range.

Like in log analytics, the time range limits query data to the specified range. Even if ago command is used in the query, the time range will apply.

For example, a query scans 60 minutes, when time range is 60 minutes, even if the text contains ago(1d). The time range and query time filtering need to match. In the example case, changing the Period / Override query time range to one day, would work as expected.

Measure

Log alerts turn log into numeric values that can be evaluated. You can measure two different things:

Count of the results table rows

Count of results is the default measure. Ideal for working with events such as Windows event logs, syslog, application exceptions. Triggers when log records happen or doesn't happen in the evaluated time window.

Log alerts work best when you try to detect data in the log. It works less well when you try to detect lack of data in the logs. For example, alerting on virtual machine heartbeat.

For workspaces and Application Insights, it's called Based on with selection Number of results. In all other resource types, it's called Measure with selection Table rows.

Note

Since logs are semi-structured data, they are inherently more latent than metric, you may experience misfires when trying to detect lack of data in the logs, and you should consider using metric alerts. You can send data to the metric store from logs using metric alerts for logs.

Example of results table rows count use case

You want to know when your application responded with error code 500 (Internal Server Error). You would create an alert rule with the following details:

  • Query:
requests
| where resultCode == "500"
  • Time period / Aggregation granularity: 15 minutes
  • Alert frequency: 15 minutes
  • Threshold value: Greater than 0

Then alert rules monitors for any requests ending with 500 error code. The query runs every 15 minutes, over the last 15 minutes. If even one record is found, it fires the alert and triggers the actions configured.

Calculation of measure based on a numeric column (such as CPU counter value)

For workspaces and Application Insights, it's called Based on with selection Metric measurement. In all other resource types, it's called Measure with selection of any number column name.

Aggregation type

The calculation that is done on multiple records to aggregate them to one numeric value. For example:

  • Count returns the number of records in the query
  • Average returns the average of the measure column Aggregation granularity defined.

In workspaces and Application Insights, it's supported only in Metric measurement measure type. The query result must contain a column called AggregatedValue that provide a numeric value after a user-defined aggregation. In all other resource types, Aggregation type is selected from the field of that name.

Aggregation granularity

Determines the interval that is used to aggregate multiple records to one numeric value. For example, if you specified 5 minutes, records would be grouped by 5-minute intervals using the Aggregation type specified.

In workspaces and Application Insights, it's supported only in Metric measurement measure type. The query result must contain bin() that sets interval in the query results. In all other resource types, the field that controls this setting is called Aggregation granularity.

Note

As bin() can result in uneven time intervals, the alert service will automatically convert bin() function to bin_at() function with appropriate time at runtime, to ensure results with a fixed point.

Split by alert dimensions

Split alerts by number or string columns into separate alerts by grouping into unique combinations. When creating resource-centric alerts at scale (subscription or resource group scope), you can split by Azure resource ID column. Splitting on Azure resource ID column will change the target of the alert to the specified resource.

Splitting by Azure resource ID column is recommended when you want to monitor the same condition on multiple Azure resources. For example, monitoring all virtual machines for CPU usage over 80%. You may also decide not to split when you want a condition on multiple resources in the scope. Such as monitoring that at least five machines in the resource group scope have CPU usage over 80%.

In workspaces and Application Insights, it's supported only in Metric measurement measure type. The field is called Aggregate On. It's limited to three columns. Having more than three groups by columns in the query could lead to unexpected results. In all other resource types, it's configured in Split by dimensions section of the condition (limited to six splits).

Example of splitting by alert dimensions

For example, you want to monitor errors for multiple virtual machines running your web site/app in a specific resource group. You can do that using a log alert rule as follows:

  • Query:

    // Reported errors
    union Event, Syslog // Event table stores Windows event records, Syslog stores Linux records
    | where EventLevelName == "Error" // EventLevelName is used in the Event (Windows) records
    or SeverityLevel== "err" // SeverityLevel is used in Syslog (Linux) records
    

    When using workspaces and Application Insights with Metric measurement alert logic, this line needs to be added to the query text:

    | summarize AggregatedValue = count() by Computer, bin(TimeGenerated, 15m)
    
  • Resource ID Column: _ResourceId (Splitting by resource ID column in alert rules is only available for subscriptions and resource groups currently)

  • Dimensions / Aggregated on:

    • Computer = VM1, VM2 (Filtering values in alert rules definition isn't available currently for workspaces and Application Insights. Filter in the query text.)
  • Time period / Aggregation granularity: 15 minutes

  • Alert frequency: 15 minutes

  • Threshold value: Greater than 0

This rule monitors if any virtual machine had error events in the last 15 minutes. Each virtual machine is monitored separately and will trigger actions individually.

Note

Split by alert dimensions is only available for the current scheduledQueryRules API. If you use the legacy Log Analytics Alert API, you will need to switch. Learn more about switching. Resource centric alerting at scale is only supported in the API version 2020-08-01 and above.

Alert logic definition

Once you define the query to run and evaluation of the results, you need to define the alerting logic and when to fire actions. The following sections describe the different parameters you can use:

Threshold and operator

The query results are transformed into a number that is compared against the threshold and operator.

Frequency

Note

There are currently no additional charges for 1-minute frequency log alerts preview. Pricing for features that are in preview will be announced in the future and a notice provided prior to start of billing. Should you choose to continue using 1-minute frequency log alerts after the notice period, you will be billed at the applicable rate.

The interval in which the query is run. Can be set from a minute to a day. Must be equal to or less than the query time range to not miss log records.

For example, if you set the time period to 30 minutes and frequency to 1 hour. If the query is run at 00:00, it returns records between 23:30 and 00:00. The next time the query would run is 01:00 that would return records between 00:30 and 01:00. Any records created between 00:00 and 00:30 would never be evaluated.

Number of violations to trigger alert

You can specify the alert evaluation period and the number of failures needed to trigger an alert. Allowing you to better define an impact time to trigger an alert.

For example, if your rule Aggregation granularity is defined as '5 minutes', you can trigger an alert only if three failures (15 minutes) of the last hour occurred. This setting is defined by your application business policy.

State and resolving alerts

Log alerts can either be stateless or stateful (currently in preview).

Stateless alerts fire each time the condition is met, even if fired previously. You can mark the alert as closed once the alert instance is resolved. You can also mute actions to prevent them from triggering for a period after an alert rule fired. In Log Analytics Workspaces and Application Insights, it's called Suppress Alerts. In all other resource types, it's called Mute Actions.

See this alert stateless evaluation example:

Time Log condition evaluation Result
00:05 FALSE Alert doesn't fire. No actions called.
00:10 TRUE Alert fires and action groups called. New alert state ACTIVE.
00:15 TRUE Alert fires and action groups called. New alert state ACTIVE.
00:20 FALSE Alert doesn't fire. No actions called. Pervious alerts state remains ACTIVE.

Stateful alerts fire once per incident and resolve. The alert rule resolves when the alert condition isn't met for 30 minutes for a specific evaluation period (to account for log ingestion delay), and for three consecutive evaluations to reduce noise if there is flapping conditions. For example, with a frequency of 5 minutes, the alert resolve after 40 minutes or with a frequency of 1 minute, the alert resolve after 32 minutes. The resolved notification is sent out via web-hooks or email, the status of the alert instance (called monitor state) in Azure portal is also set to resolved.

Stateful alerts feature is currently in preview in the Azure public cloud. You can set this using Automatically resolve alerts in the alert details section.

Location selection in log alerts

Log alerts allow you to set a location for alert rules. In Log Analytics Workspaces, the rule location must match the workspace location. In all other resources, you can select any of the supported locations, which align to Log Analytics supported region list.

Location affects which region the alert rule is evaluated in. Queries are executed on the log data in the selected region, that said, the alert service end to end is global. Meaning alert rule definition, fired alerts, notifications, and actions aren't bound to the location in the alert rule. Data is transfer from the set region since the Azure Monitor alerts service is a non-regional service.

Pricing and billing of log alerts

Pricing information is located in the Azure Monitor pricing page. Log Alerts are listed under resource provider microsoft.insights/scheduledqueryrules with:

  • Log Alerts on Application Insights shown with exact resource name along with resource group and alert properties.
  • Log Alerts on Log Analytics shown with exact resource name along with resource group and alert properties; when created using scheduledQueryRules API.
  • Log alerts created from legacy Log Analytics API aren't tracked Azure Resources and don't have enforced unique resource names. These alerts are still created on microsoft.insights/scheduledqueryrules as hidden resources, which have this resource naming structure <WorkspaceName>|<savedSearchId>|<scheduleId>|<ActionId>. Log Alerts on legacy API are shown with above hidden resource name along with resource group and alert properties.

Note

Unsupported resource characters such as <, >, %, &, \, ?, / are replaced with _ in the hidden resource names and this will also reflect in the billing information.

Note

Log alerts for Log Analytics used to be managed using the legacy Log Analytics Alert API and legacy templates of Log Analytics saved searches and alerts. Learn more about switching to the current ScheduledQueryRules API. Any alert rule management should be done using legacy Log Analytics API until you decide to switch and you can't use the hidden resources.

Next steps