Monitoring and Alerting in Azure
The following features are provided to monitor Azure resources:
- Azure Status – portal showing status of all Azure services
- Service Health – information on health of Azure services
- Resource Health – information on health of Azure resources
- Audit Logs – outcome of management operations
- Metric Alerts – metric alerts created for Azure resources
- Azure Insights – consumption of monitoring information
- Log Analytics – searchability of monitoring information
- Security Center – results of a security analysis of virtual machines
The Azure Status web site provides up-to-date information on the current status of Azure services. It comprises two pages:
- Current Status – information on the current state of all Azure services in all public Azure regions
- History – impact summary for significant Azure service outages
Azure Service Health provides information about service interruptions and performance degradations that may impact the Azure services used in a subscription. The information is about the service in general and is not specific to its use by individual resources in a subscription. Azure Service Health provides the history of changes to the incident status.
The Service Health information is accessed through the Service Health blade in the Production Portal – and is available in summary or detail form, filtered by timespan. It can also be displayed by setting an Event Category filter of Service Health in the Audit Logs blade in the Production Portal. Stephen Siciliano documents how to do this here.
The Service Health information can be accessed using the Get-AzureRmLog command as in the following example:
Get-AzureRmlog -ResourceProvider Azure.Health -StartTime '01/01/2016' -EndTime '01/12/2016' -DetailedOutput
The Azure Resource Manager Insights API can be used to configure the automatic emailing of Azure Service Health alerts when a service health incident occurs. Matt Loflin has written a blog post showing how to setup email alerts for Azure Service Health using the preview Azure Insights .NET API. It is not currently possible to configure these alerts in the Production Portal or directly in either Azure PowerShell or the Azure CLI – however it is possible to use this ARM Template to “deploy” an alert.
Azure Resource Health provides information about the current health of a resource. Bernardo Munoz introduces Azure Resource Health in this post.
Currently, Resource Health provides information for the following Azure services:
- Virtual Machine (classic)
- Web App
- SQL Database
The Resource Health blade in the Production Portal lists all the resources for which resource health information is available. Additional information is then accessible for each resource, along with the opportunity to perform additional health checks and troubleshooting.
This information can also be found using the Check health link on the Settings property for an individual resource.
There is a preview Azure Resource Manager API that allows the retrieval of the resource health reports for a subscription, resource group or individual resource. The information retrieved is the same as that displayed in the Production Portal.
Audit Logs provide status on every management operation performed against an Azure resource.
The Audit Logs blade in the Production Portal lists all the operations in the past week. Additional summary and detailed information is then accessible for each operation.
The primary Audit Logs blade has a filter that can be used to filter the summary list by:
- Resource group
- Resource type – e.g., Microsoft.Compute/virtualMachines
- Resource Level – critical, error, warning, informational
- Time span
- Caller – i.e., user who invoked the operation
Audit Logs are also accessible from Azure PowerShell. The Get-AzureRmLog command can be used to retrieve the same summary and detailed information provided on the Production Portal. The command has various parameters allowing the returned information to be filtered. (Note that in PowerShell 0.x there are 3 separate commands – one each for subscription, resource group, resource provider).
For example, the following Azure PowerShell command retrieves detailed Audit Log events in the specified time interval for the Microsoft.Compute resource provider:
Get-AzureRmLog -ResourceProvider Microsoft.Compute –StartTime '1/1/2016' -EndTime '1/5/2016' -DetailedOutput
Some Azure resources support the collection of performance metrics and the creation of metric alert rules that fire when certain criteria are met, such as CPU % greater than 90% for 5 minutes.
Metric alert rules can be configured to send alert emails to specified Azure administrators and email addresses. They can also be configured to invoke web hooks which can cause additional notification, such as to pagers, messaging services, etc.
Stephen Siciliano has documented how to configure metric alerts in the Production Portal.Rob Boucher has documented how to configure alert web hooks in the Production Portal. It is also possible to use the Azure Resource Manager API to create metric alerts, similar to the creation of Azure Service alerts.
Azure Insights is an Azure Resource Manager API supporting the retrieval of information about Azure resources and the use of this information to alter the behavior of the resources. Among other features it provides support for:
- Management of Azure service health alerts
- Management of metric alerts for a resource
- Retrieval of metrics
- Retrieval of Azure Service Events
- Retrieval of Audit Logs
- Retrieval of usage quotas
Azure Insights can be used only with ARM resources, and it provides no support for ASM services. Not all ARM resources support Azure Insights currently.
Azure Insights is an Azure Resource Manager REST API that can be used with Virtual Machines to:
- Configure the collection of metric information
- Create alerts based on metric values
- Configure autoscaling
The collection of metric information from a VM requires the deployment of a VM extension:
- Name: Microsoft.Insights.VMDiagnosticsSettings
- Type: Microsoft.Azure.Diagnostics.IaaSDiagnostics
This can be done in the Production Portal or using the Set-AzureRmVmDiagnosticsExtension PowerShell cmdlet. The information captured, including performance counters, is specified in a configuration file generated automatically when the portal is used or uploaded when PowerShell is used. Once captured, the diagnostic information is persisted into a specified Azure Storage account. The schema for the configuration is documented here. Note that the use of a Metrics element is essential for the information to be accessible using Azure Insights. For example:
Once configured, the metric information is available both on the Production Portal and through the Azure Insights API where it can be used for monitoring, autoscaling and the generation of performance-based alerts (such as high CPU % for 10 minutes).
Log Analytics is the monitoring feature of the Operational Management Suite (OMS). Among other features, it provides a way to ingest logs from IaaS Virtual Machines and PaaS Cloud Services and make these logs searchable in the Log Analytics Portal. Note that Log Analytics was recently renamed from Operational Insights.
Log Analytics supports two ingestion techniques: agent-based or Azure Storage account based. Agent-based ingestion uses the Microsoft Monitoring Agent deployed onto the VM to transfer data directly into Log Analytics. Azure Storage account ingestion uses a standard PaaS or IaaS Diagnostics extension to capture and persist monitoring data to an Azure Storage account from which Log Analytics can then pull it into its datastore.
The types of data supported vary between agent and storage-based ingestion.
Agent-based ingestion supports the following types of data:
- Windows Event logs
- Windows Performance counters
- Linux Performance counters
- IIS logs
- Syslog (Unix)
Storage-based ingestion supports the following types of data (specifically not including performance counters):
- WindowsEvent logs
- IIS Logs
- Syslog (Unix)
- ETW Logs
- Service Fabric Events
The Production Portal can be used to configure a Log Analytics workspace and configure VMs to be monitored and storage accounts to be used for data ingestion. Other elements of the configuration must be performed using the Log Analytics Preview Portal – such as the configuration of which data is to be ingested through the agent (Overview/Settings/Data tab). The ingested logs can be queried using the Log Analytics Preview Portal.
The Log Analytics Preview Portal provides 32-bit and 64-bit versions of the Microsoft Monitoring Agent (Overview/Settings/Connected Sources tab), allowing it to be deployed anywhere and then configured using Log Analytics workspace credentials allowing the agent to connect securely. ARM Templates (Windows, Ubuntu) can be used to deploy and configure the Microsoft Monitoring Agent on Windows or Ubuntu VMs.
Alternatively, the IaaS Diagnostics extension can be deployed onto the VM and configured to persist monitoring information to an Azure Storage account configured as a source for Log Analytics ingestion. For PaaS cloud services, the PaaS Diagnostics Extension can be deployed into the cloud service roles and then configured to persist monitoring information to an Azure Storage account configured as a source for Log Analytics ingestion.
Azure Security Center
Azure Security Center is a preview service available in the Production Portal that provides for the creation of security policies and the monitoring of various Azure resources to identify non-compliance with the policies. It also provides information on mitigating non-compliance. Currently, Azure Security Center supports the monitoring of Virtual Machines (v1 and v2), Endpoints, Network Security Groups, Web Application Firewall and Azure SQL Database.
For Virtual Machines, Azure Security Center works by installing several monitoring extensions (Microsoft.EnterpriseCloud.Monitoring/MicrosoftMonitoring and Microsoft.Azure.Security/Monitoring) onto VMs as well as mitigation agents (Microsoft.Azure.Security/IaaSAntimalware).
The following policy areas can be configured currently:
- System updates
- Baseline rules
- Access Control List on endpoints
- Network Security Groups
- Web Application Firewall
- SQL Auditing
- SQL Transparent Data Encryption
System updates, baseline rules and antimalware support requires that data collection be configured for VMs.