Monitoring for Heartbeat Failures on non-Server Agents
Monitoring for agent heartbeat failures is a built in function in OpsMgr. Heartbeats ensure that agents running in the environment are available and that the agent is able to interact with it’s management server. All agents, regardless of whether on a server or workstation class system, participate in sending heartbeats. An agent that is not sending heartbeats is noted in the OpsMgr console in a ‘grey’ state.
For server systems, an alert is generated in conjunction with the grey state to indicate the heartbeat failure.
An agent heartbeat failure is detected by the ‘Health Service Heartbeat Failure’ monitor. Note that this monitor is an aggregate rollup.
If we look at this monitor, specifically the override summary, we note something interesting – there is an override in place by default that will prevent any agents on Workstations from generating the failed heartbeat alert.
The failed heartbeat is still noted by the grey state, but no alert is generated. Why would this be the case? In general, workstation systems are expected to be more in a state of change – meaning reboots, shut down for periods of time, etc. – than are server class systems. it is still important to know when the workstation is not heartbeating (grey state) but it would be annoying to continually receive heartbeat failure alerts.
OK – that makes sense. But, what if there are a few workstation class systems that will be treated as server systems – meaning that they are expected to be available routinely. Is there a way to enable the standard heartbeat failure alerts for those systems? Yes. To understand this, let’s first revisit the override that disables alerts on these systems. Note the name of the override corresponds to a built-in group. If we look in the groups node and filter by the name shown in the override, we find our group of interest.
If we look at the membership of this group we notice that my test XP system is included by default.
The most important thing here – note the icon circled above. The object contained in this group is NOT a standard computer object. Heartbeat failure detection is the job of a special class of objects called ‘health service watcher’ objects. Thats what the object in the group is – noted by the glasses.
So, we see our workstation system in this group so we know that because of this group and because of the override, we won’t get alerts when this workstation system fails to heartbeat. So how can we change this? Simple, create your own group containing workstation objects that you DO expect to be available routinely and where you DO want to receive heartbeat failure alerts if there is a problem. In my test environment, I created a group with explicit membership containing my workstation object. My group is called ‘test XP Heartbeat Group’, as you will see in a bit. For now, the screenshots look very much the same
The key here is to make sure the group you are creating contains ‘health service watcher’ objects rather than windows computer objects – if you end up with windows computer objects, the group will be meaningless for use with the heartbeat monitor.
Now that we have our group created that we want to use to specify which workstation class systems should behave like servers in terms of heartbeat failure detection, we need to use that group to introduce another override to turn on alerting.
So, what we have now is the default override to disable alerting for all workstation class systems and then another override to turn it back on for my select group of workstation systems. In my test lab, both groups contain the same test system. In production, the default group will likely contain more systems than the custom group you create to enable alerting.
With all of this in place, I stop the ‘System Center Management’ service (formerly knows as the OpsMgr Health Service in RTM and SP1) and wait for my agent to go into a grey state. Now when I inspect the alerts pane, I see a heartbeat failure alert – just as I expect.