The Challenges with Manual Reset Monitors
Recently some of my peers were discussing issues that their customers are encountering with manual reset monitors. I thought the main points of the discussion were worth sharing.
What is a Manual Reset Monitor?
It is a type of monitor that detects an error condition, turns “red”, and generates an alert. What it does not do is detect the corresponding “success” condition, therefore someone must manually go into the OpsMgr console and reset the state of the monitor that went red back to “green” or healthy. A more useful type of monitor is one that goes “red” when an error occurs and then “green” when the error is no longer occurring – and this is what most well written monitors do.
What challenges do they present?
For administrators that rarely use the OpsMgr Console:
- Many administrators do not live in the OM console. Often they forward OpsMgr alerts through a product connector to another ticketing system. The scenario is that a manual reset monitor turns red, because of an error condition being found, and an alert gets generated. The alert gets sent up through the product connector to a ticketing system and a ticket gets generated for the appropriate administrator (Exchange for example). The Exchange admin may not even have access to the OpsMgr console, they only have the ticket which contains the data from the OpsMgr alert. They use that information to resolve the problem and close the ticket. The product connector closes the alert on the OpsMgr side, but of course in OpsMgr 2007 R2 the alert is disconnected from the manual reset monitor so the monitor NEVER goes back to green. The next day the problem recurs but no alert gets generated because the monitor is still red and therefore no ticket gets generated.
For administrators that do use the OpsMgr Console:
- For administrators that do live in the console they still must know that an alert came from a manual reset monitor in order to reset it. Many administrators won't realize this, solve the problem, and close the alert. Maybe the problem recurs the next day and then they don't get alerted.
What are an MP author’s alternatives?
The top priority in a monitoring product is generally to notify the appropriate administrator of a problem when an issue occurs. Health state (and therefore monitors) are great and add value, but this should generally not come at the expense of failing to notify/alert on an error condition or recurrence of an error condition. There are several alternatives available to an MP author that require little additional MP development time.
- Use auto resolving or timer reset monitors instead of manual reset monitors. This ensures that the administrator will eventually get re-alerted at some point. If this isn’t possible then you can use one of the other 3 solutions below.
- Modify the manual reset monitors so they don't alert and create alert generating rules (with suppression), targeted at the same class, that look at the same criteria. This ensures that you have a health model, and if the administrator chooses can still reset the health when necessary. It also ensures that the administrator will get re-alerted if the problem recurs. If you must have manual reset monitors this is the best of both worlds because you can ensure alerting will happen on the recurrence of an issue and you still maintain the health model that monitors provide.
- Create rules and manual reset monitors for you error conditions that require them. Disable one or the other out of the box and document this. This way the administrator only needs to create overrides to use either one.
- Replace the manual reset monitors with rules. You don't get the health model but you can ensure that your MP alerts the administrator of a potential problem.
What realistic workarounds are available after an MP has been developed which contains manual reset monitors?
- You can disable the manual reset monitors with overrides and create their own MP that contains rules to watch for the same error condition.
- Use a tool, like GreenMachine from Tim Helton, to reset the health of monitors on a regular basis. The result of this is that you may get re-alerted on issues that you are currently working on and the overhead / time required of resetting monitors in bulk can be significant. http://blogs.technet.com/b/timhe/archive/2009/01/15/announcing-the-greenmachine-utility-for-operationsmanager-rtm-sp1-and-r2.aspx