Understanding Monitors in OpsMgr 2007 Part I – Unit Monitors
At MMS 2009 I presented a topic on understanding monitors in OpsMgr 2007. According to the feedback, that session was well received so I thought I would convert some of it to a series of blog entries.
There are two key ways of delivering monitoring in opsMgr 2007 – rules and monitors. At first glance, rules appear to deliver much the same monitoring as monitors.
There are some similarities for sure but rules and monitors are actually very different things. There are two major things to understand about rules compared to monitors. Rules have zero impact on measuring the health of the object being monitored. In addition, rules can collect data and monitors don’t. An example is instructive. If you have a need to both collect performance data and also have the measurement of the same performance data impact the total health of the monitored object, you need both a rule and a monitor. Why? Again, monitors don’t collect anything – they just evaluate the data live and reflect back what is found in terms of health state changes. Rules collect data but have no impact on health state. So, in this example scenario, you need both.
Notice that in the preceding paragraph I made reference to monitors and rules being associated with an object. What I really am talking about here is the class (aka, object) against which a rule or monitor is targeted. Understanding targeting is pivotal to understanding OpsMgr 2007. If targeting is a confusing topic for you or you want to refresh yourself on proper targeting techniques, take a look at the article I published in Technet magazine discussing this topic in detail - http://technet.microsoft.com/en-us/magazine/2008.11.targeting.aspx?pr=blog
Enough about rules – the topic at hand is to discuss monitors! Monitors are where you see the power of OpsMgr 2007 and, from the list above, you can see that there is substantially more flexibility when using monitors vs. rules. Remember, monitors are al about health and that is the goal of OpsMgr 2007. To restate, monitors watch whatever they are monitoring – performance data, WMI, event logs, whatever – and tell administrators about the results of monitoring by changing the health state of the object being monitored. One point here – on the rules node you see a category for alert generating rules. Monitors can definitely generate an alert as well so don’t think you are missing out on that ability by choosing a monitor!
There are three categories of monitors – unit monitors, aggegrate monitors and dependency monitors.
Unit monitors can be thought of as the workhorse of monitoring and unit monitors drive health detection in OpsMgr. Without unit monitors you would never know a problem exists! The best way to get to know unit monitors is to work with them. A caution here – make sure you do your testing in a lab environment as if operating in the production environment, any changes made take place right away. If building multiple monitors for testing this could cause notable churn in the production environment – unexpected churn is far easier to absorb in a test lab!.
There are lots of options for unit monitors – ranging from very straight forward to very complex. Discussing each and every unit monitor is beyond the scope of this blog entry but there are a couple that are particularly interesting.
Simple Event Detection – Detecting a simple event is easy with most monitoring solutions – including OpsMgr. The Simple Event Detection monitor is, well, simple. I describe it here as a starting point and because it will provide some good discussion on building monitors in general that is applicable to any monitor.
From the create monitoring wizard select to create a Simple Event Detection monitor. For our example we will use a Windows Event Reset as the type – more about that in a minute. Make sure you choose a management pack other than the default management pack to store this monitor!
On the general properties screen, choose a target and parent monitor. For our example lets assume we will be delivering additional monitoring to SQL 2005 Servers. If the SQL 2005 Management Pack is installed we will have a target called SQL 2005 DB Engine. Select that target. The next choice is which Parent Monitor should ‘contain’ our unit monitor. For monitors, there are 4 general parent monitors that may be an option for you – availability, configuration, performance and security. You may also see one for Backwards Compatibility but that isn’t a category that you should use when authoring a monitor yourself. These categories allow grouping of unit monitors according to the general intended purpose. If, for example, our monitor will be looking for events that could impact the general availability of the monitored object , it should be placed under availability. If our monitor will be looking for events that could impact the general performance of the monitored object, it should be placed under performance. We will discuss these categories in greater detail in part II of this post because each category is actually itself a monitor – an aggregate monitor.
We are building a simple event detection monitor so the next screen will ask for the event log OpsMgr should look in for the event. We will leave the default of application but note that this could be any event log in place on the system.
So we’ve chosen the application log, now we have to specify the event to look for with our monitor.
Click next and – we are being asked again for the event log we want to use? What’s going on here? This is an excellent opportunity to discuss the option we first select when we started building our monitor. We chose that we wanted a Simple Event Detection monitor and we chose that it should be a windows event reset monitor. Remember that monitors are all about health. Monitors detect when a healthy condition goes unhealthy and can ALSO detect when an unhealthy condition goes back to health. That is, in fact, the holy grail of monitors – to detect when an unhealthy situation takes place and then automatically detect when the unhealthy condition goes away! And that is exactly what is being done here. One event, the first one with Event ID 1234, will indicate an unhealthy condition has taken place and now this second event – in the same or different event log – will indicate that the unhealthy condition has been resolved – completely automatic! Of course, not all monitoring scenarios lend themselves to that which is why that in addition to windows event reset we also have options for timer reset and manual reset. Timer reset is a situation where we detect the unhealthy condition and immediately start a timer. If the unhealthy condition has not been detected again during our timing period (defined on the monitor) then we revert the health back to a healthy state. If we detect the unhealthy condition again during the timing period, the timing period starts over. The manual reset monitor means that once the unhealthy condition is noted it will not be reset until either manually touched or reset by some scripting method. The manual reset monitor should not be in wide use and, when used, should be for very specific scenarios.
We will again select the Application event log.
And select the event criteria that will indicate monitoring is again healthy
Next we pull our two event ID’s together and select whether the first event will raise a warning or a critical health state (there is a drop down that will allow you to select warning or critical but you can’t see it until you click on warning to change if needed). The second event will return us to a health status.
The final screen of the wizard allows selection of whether or not this monitor will generate an alert
Select to create and the monitor is saved. If we go find our new monitor and select properties on it we see that there are actually additional items we can configure, such as product knowledge, diagnostic and recovery and overrides that are not part of the initial wizard. Product knowledge allows information about the monitor and how to resolve detected problems to be recorded. Diagnostic and Recovery allows specific steps to be configured as a response to the monitor changing state that may aid in diagnosing or fixing the problem and overrides are where you specify any conditions other than the default that should be in place for all monitored objects or a subset thereof. It is only possible to override values that have been authored to allow overriding.
We’ve gone screen by screen for our first example to illustrate a few important key concepts that will generally apply to all monitors. For our next examples of interesting unit monitors we won’t go screen by screen but only show the relevant screens.
WMI Performance Monitors
In some cases it would be helpful to have a performance monitor style method of collecting information about an object but there is no performance counter available. An example of this might be file size. Suppose you have a particular log file that needs to be monitored for an increase in size. You look in the performance monitor counters and there is not a counter available. What option do you have? Certainly a script would be a workable choice but you may not be comfortable scripting. Is there a way to take file size information and convert it to performance data that can be used in OpsMgr? Absolutely – and there are lots of other examples too beyond the file size example. I have previously described exactly how to setup such a scenario here so I will avoid repeating but do take a look at this example as having this ability really is powerful.
Log File Monitoring
Another monitor type that is little used but very useful is the log file monitors. Many applications have log files they write to indicate application processing, error conditions, etc. SQL has it’s error log and SCCM/SMS is chock full of logged information and there are lots of other examples. Over time there will be conditions in your environment that you know to be problems that arise when specific log entries show up. In many cases the provided SQL and/or SCCM management packs will handle the errors from tose systems for you but in cases where they don’t, being able to craft your own log monitor is useful. And it’s not difficult. Here are the relevant configuration screens.
On the Application Log Data Source screen we need to configure the directory where our log(s) are stored and then a pattern that will specify how to search. Note that wildcards are supported so it would be possible to search multiple log files of similar name.
Next, we need to configure what information in the log file we want to detect. In this case we are looking for the text bummer – but it could also be a string of text rather than a single word. Also, the parameter name will be the same regardless – this is the syntax to specify we are defining the first parameter of interest which is all I’ve ever needed.
The next step is to configure how long we will let the alert remain before automatically resetting it. This configuration is a timer based monitor. I chose this option just to illustrate another example. Note that you could have just as easily setup an event reset monitor here – all that would be needed is to define another parameter that would flag the health condition. Also note that my timer reset is set to an hour. This means that if the problem condition is not detected again for an hour that the monitor will reset but, if the problem condition is detected again, the timer gets reset and another full hour will be required before considering the problem cleared up.
There rest of the screens are similar to what has already been seen.
Repeated Event Detection
I’ve already described how to configure to detect a simple event. With such a configuration, when the configured event comes in it is detected and configure action, such as raising an alert, takes place. There could be situations though where a single event may not indicate a problem. However, if 3 of the same events happen within a 15 minute window, that may indicate a problem that we need to investigate. Taking what we already know from the simple event monitor it’s easy to configure the repeat monitor. Just choose the repeated event monitor, configure what event we care about and then you will get to the screen shown below to configure our repeat settings.
There are a few choices for counting mode but the one I find simple and useful is the Trigger on count, sliding. Basically this means that when the first event shows up we will start a timer for, in this case, 15 minute. If another of the events shows up within tat 15 minutes then we will consider this a problem and move on to take action. You can configure the repeat count however you like by adjusting the compare count settings. There are other options here you can explore as you like.
Correlated Event Monitor
A more complex but very useful monitor is the correlated event detection monitor. Using this monitor it is possible to configure OpsMgr to watch for complex event patterns, whether in the same event log of different event logs, and alert only when the pattern specified is entered. For our example I’ve chose an windows event reset monitor which means that OpsMgr will watch for a specific windows event to trigger the reset of health after a problem occurs. I won’t go through configuring every event screen because it’s all similar to what we’ve already seen in the simple event discussion. Note on the screenshot below, however, that there were 3 events to configure. The first even is the reset event while events A and B are the correlated events. The screen below also shows options for how events A and B correlate to one another.
On the correlation screen we have a few options to discuss. First, is the correlation interval. Like the repeated event detection this interval specifies how long to watch for the event pattern after receiving the first event. If the pattern doesn’t manifest in this configured time then there will be no change to health. Also, there are multiple options to correlation settings as shown below. The wording here may be confusing at first but the graphic shown on the configuration page will change as you move from option to option to illustrate what each option does. Finally, note the occurrence and Expression options. With these options you can increase the complexity of our filter by configuring how often our patter should occur as well as specific expression information that is beyond what we are discussing here.
Finally, we have the health screen that pulls this all together. Note the event raised refers to the first event configured that will trigger a healthy condition where as match to the correlated event loging we just configured will result in a warning event (you can change to a critical if you like)
Part II of our discussion will turn to the aggregate rollup monitor and how it is used in OpsMgr.