How I do it: SCOM alert handling/management
This week I had a discussion with some colleagues about SCOM alert handling/management. When do I close an alert? How do I do that? What about the monitors causing the alert? How can we automate this stuff? What about roll-up monitors? A lot of questions and I never took the time to think about that in the past. This time I did and the result is this post. It is mostly a reference for myself, but I hope it gives some other people some guidance as well.
SCOM alert handling scenarios
What kind of alerts do we have to handle in SCOM? What are good practices and what are bad practices?
I came along with five different alert handling scenarios (see attached PDF for better readability):
Automating the scenarios
In an ideal world:
- We only get actionable, meaningful alerts -> every alert makes perfect sense
- every monitor alert gets automatically resolved once the underlying root cause has been fixed
- no SCOM Operator will ever close a monitor alert without ensuring that the causing monitor is healthy
TO BE VERY CLEAR:
IF YOU CLOSE AN ALERT CREATED BY A UNIT MONITOR THIS MONITOR WILL NOT SWITCH BACK TO A HEALTHY STATE. AS LONG AS THE ROOT CAUSE HAS NOT BEEN FIXED THE MONITOR STAYS UNHEALTHY AND WILL NOT TRIGGER ANY NEW ALERT!!!
But unfortunately we don’t live in an ideal world so I created a little script (you can download it from TechNet Gallery here) to help me with the alert handling and management according to Microsofts good practices.
This script handles alerts:
- It closes all open SCOM alerts (ResolutionState <> 255) where the TimeRaised property is older than a certain age.
-> This will clean up my console if the Operators have forgotten to close alerts in time
- It closes all new SCOM alerts (ResolutionState = 0) where the TimeRaised property is older than the configured SCOM alert grooming age (default 7 days)
-> This will clean up my console if the Operators have forgotten to acknowledge or otherwise handle open alerts
- It closes alerts created by specific workflows in a configurable amount of time by using an XML file with custom alert handling rules
-> This allows you to configure exceptions to the rules above. E.g. some alerts created by the AD MP rules are not relevant after X hours. So you can close the alert safely because it will be triggered again if the issue still exists.
But this script takes also care of the causing monitors and their health state:
- If a monitor alert is closed by the script it will automatically reset the causing monitor if necessary.
-> This will ensure that if the root cause has not been fixed the monitor will trigger a new alert during the next execution cycle.
- It analysis closed monitoring alerts for the last 24 hours. If there is a closed monitor alert not tagged by the script than it is probable that an Operator has closed this alert manually. The script checks if the causing monitor has been reset. If not, the script will reset the monitor.
-> This handles the common mistake that Operators will close alerts without taking care of the causing monitor.
The script is a working proof of concept. Requirements are:
- PowerShell 2.0
- SCOM 2012 PS module (e.g. SCOM 2012+ Console installed)
- SCOM 2012 agent installed
The reason for the SCOM agent requirement is that you can easily execute this script in a SCOM workflow context (e.g. timed rule). But you can also execute it simply in a shell. It has a detailed command based help included which explains all parameters:
The script will write detailed log events to the Operations Manager event log (hence the requirement for the SCOM 2012 agent):
What do you think? Do you have more SCOM alert scenarios that should be handled? Please let me know! Feedback is als always highly appreciated.