Understanding alerts in Log Analytics
Alerts in Log Analytics identify important information in your Log Analytics repository. This article provides details of how alert rules in Log Analytics work and describes the differences between different types of alert rules.
For the process of creating alert rules, see the following articles:
- Create alert rules using Azure portal
- Create alert rules using Resource Manager template
- Create alert rules using REST API
Alerts are created by alert rules that automatically run log searches at regular intervals. If the results of the log search match particular criteria then an alert record is created. The rule can then automatically run one or more actions to proactively notify you of the alert or invoke another process. Different types of alert rules use different logic to perform this analysis.
Alert Rules are defined by the following details:
- Log search. The query that runs every time the alert rule fires. The records returned by this query is used to determine whether an alert is created.
- Time window. Specifies the time range for the query. The query returns only records that were created within this range of the current time. This can be any value between 5 minutes and 24 hours. For example, If the time window is set to 60 minutes, and the query is run at 1:15 PM, only records created between 12:15 PM and 1:15 PM is returned.
- Frequency. Specifies how often the query should be run. Can be any value between 5 minutes and 24 hours. Should be equal to or less than the time window. If the value is greater than the time window, then you risk records being missed.
For example, consider a time window of 30 minutes and a frequency of 60 minutes. If the query is run at 1:00, it returns records between 12:30 and 1:00 PM. The next time the query would run is 2:00 when it would return records between 1:30 and 2:00. Any records created between 1:00 and 1:30 would never be evaluated.
- Threshold. The results of the log search are evaluated to determine whether an alert should be created. The threshold is different for the different types of alert rules.
Each alert rule in Log Analytics is one of two types. Each of these types is described in detail in the sections that follow.
- Number of results. Single alert created when the number records returned by the log search exceed a specified number.
- Metric measurement. Alert created for each object in the results of the log search with values that exceed specified threshold.
The differences between alert rule types are as follows.
- Number of results alert rule will always create a single alert while Metric measurement alert rule creates an alert for each object that exceeds the threshold.
- Number of results alert rules create an alert when the threshold is exceeded a single time. Metric measurement alert rules can create an alert when the threshold is exceeded a certain number of times over a particular time interval.
Number of results alert rules
Number of results alert rules create a single alert when the number of records returned by the search query exceed the specified threshold.
The threshold for a Number of results alert rule is simply greater than or less than a particular value. If the number of records returned by the log search match this criteria, then an alert is created.
This type of alert rule is ideal for working with events such as Windows event logs, Syslog, and Custom logs. You may want to create an alert when a particular error event gets created, or when multiple error events are created within a particular time window.
To alert on a single event, set the number of results to greater than 0 and both the frequency and time window to 5 minutes. That runs the query every 5 minutes and check for the occurrence of a single event that was created since the last time the query was run. A longer frequency may delay the time between the event being collected and the alert being created.
Some applications may log an occasional error that shouldn't necessarily raise an alert. For example, the application may retry the process that created the error event and then succeed the next time. In this case, you may not want to create the alert unless multiple events are created within a particular time window.
In some cases, you may want to create an alert in the absence of an event. For example, a process may log regular events to indicate that it's working properly. If it doesn't log one of these events within a particular time window, then an alert should be created. In this case you would set the threshold to less than 1.
Performance data is stored as records in the OMS repository similar to events. If you want to alert when a performance counter exceeds a particular threshold, then that threshold should be included in the query.
For example, if you wanted to alert when the processor runs over 90%, you would use a query like the following with the threshold for the alert rule greater than 0.
Perf | where ObjectName=="Processor" and CounterName=="% Processor Time" and CounterValue>90
If you wanted to alert when the processor averaged over 90% for a particular time window, you would use a query using the measure command like the following with the threshold for the alert rule greater than 0.
Perf | where ObjectName=="Processor" and CounterName=="% Processor Time" | summarize avg(CounterValue) by Computer | where CounterValue>90
If your workspace has not yet been upgraded to the new Log Analytics query language, then the above queries would change to the following:
Type=Perf ObjectName=Processor CounterName="% Processor Time" CounterValue>90
Type=Perf ObjectName=Processor CounterName="% Processor Time" | measure avg(CounterValue) by Computer | where AggregatedValue>90
Metric measurement alert rules
Metric measurement alert rules are currently in public preview.
Metric measurement alert rules create an alert for each object in a query with a value that exceeds a specified threshold. They have the following distinct differences from Number of results alert rules.
While you can use any query for a Number of results alert rule, there are specific requirements the query for a metric measurement alert rule. It must include a Measure command to group the results on a particular field. This command must include the following elements.
- Aggregate function. Determines the calculation that is performed and potentially a numeric field to aggregate. For example, count() will return the number of records in the query, avg(CounterValue) will return the average of the CounterValue field over the interval.
- Group Field. A record with an aggregated value is created for each instance of this field, and an alert can be generated for each. For example, if you wanted to generate an alert for each computer, you would use by Computer.
- Interval. Defines the time interval over which the data is aggregated. For example, if you specified 5minutes, a record would be created for each instance of the group field aggregated at 5 minute intervals over the time window specified for the alert.
The threshold for Metric measurement alert rules is defined by an aggregate value and a number of breaches. If any data point in the log search exceeds this value, it's considered a breach. If the number of breaches in for any object in the results exceeds the specified value, then an alert is created for that object.
Consider a scenario where you wanted an alert if any computer exceeded processor utilization of 90% three times over 30 minutes. You would create an alert rule with the following details.
Query: Perf | where ObjectName == "Processor" and CounterName == "% Processor Time" | summarize AggregatedValue = avg(CounterValue) by bin(TimeGenerated, 5m), Computer
Time window: 30 minutes
Alert frequency: 5 minutes
Aggregate value: Great than 90
Trigger alert based on: Total breaches Greater than 5
The query would create an average value for each computer at 5 minute intervals. This query would be run every 5 minutes for data collected over the previous 30 minutes. Sample data is shown below for three computers.
In this example, separate alerts would be created for srv02 and srv03 since they breached the 90% threshold 3 times over the time window. If the Trigger alert based on: were changed to Consecutive then an alert would be created only for srv03 since it breached the threshold for 3 consecutive samples.
Alert records created by alert rules in Log Analytics have a Type of Alert and a SourceSystem of OMS. They have the properties in the following table.
|Object||Metric measurement alerts will have a property for the group field. For example, if the log search groups on Computer, the alert record with have a Computer field with the name of the computer as the value.|
|AlertName||Name of the alert.|
|AlertSeverity||Severity level of the alert.|
|LinkToSearchResults||Link to Log Analytics log search that returns the records from the query that created the alert.|
|Query||Text of the query that was run.|
|QueryExecutionEndTime||End of the time range for the query.|
|QueryExecutionStartTime||Start of the time range for the query.|
|ThresholdOperator||Operator that was used by the alert rule.|
|ThresholdValue||Value that was used by the alert rule.|
|TimeGenerated||Date and time the alert was created.|
- Install the Alert Management solution to analyze alerts created in Log Analytics along with alerts collected from System Center Operations Manager.
- Read more about log searches that can generate alerts.
- Complete a walkthrough for configuring a webhook with an alert rule.
- Learn how to write runbooks in Azure Automation to remediate problems identified by alerts.