All network devices not monitored

Question

Hey guys
This is my first time posting a question on this forum. I hope this is how this forum is supposed to be used.

My question:
I have some 580 freshly (yesterday) discovered network devices. For some reason all of them are labeled as "not monitored".
Now I'm not sure why that would be so. The network devices are all reachable with the provided SNMP credentials (obv, otherwise they wouldn't even get discovered righ?).

The devices are all monitored by a single RP (default observer enabled) with two GWs as members. Searching for the cause of this problem, I noticed part of the "Monitoring networks by using Operations Manager" MS docs page.

"500 network devices (approximately 6,250 monitored ports) managed by a resource pool that has two or more gateway servers"

Could this be the cause? Is this scenario actually blocked? If so, how could I get past this?
Those devices (+some 70 more) are in a network segment, in which I have 2 GWs, no more. Could I simply create a second RP? I would like to have some redundancy so I would prefer not to have to RPs with only one member.

Any suggestions?

Regards,
Simon

Answer

Hi Simon,

this is really strange. I would assume that if you get all the network devices discovered, the monitoring should actually work.
I am know that your Firewall is OK, because we discussed this within your TechNet post, so am wondering what the cause for this could be.
What I would do if I were you:

First, filter the Event Logs on my two Gateways (RP) and look for any related warnings or errors.
Second, empty the Health Service cache on those two servers.
Open the health Explorer of a managed network device and check in the monitor hierarchy if there are monitors, which are initialized.

There is a limit of monitored network devices indeed, but I can tell you for sure that:

it is not a hard limit, meaning that this is only a number Microsoft has obtained after testing. This number is influenced by many other factors.
Even if your GW RP is overwhelmed by the amount of data, you should have some of the network devices monitored or at least some of their ports.

So if this with ports is sorted out:
SCOM network device ports
Then I would dig in the logs and try to find a clue there.

You can still test the capacity limits stuff by deleting some of the network devices and leaving only a small number of devices (just as a test).

Regards,
Stoyan

Answer

Based on my researching, the Microsoft limits are based on number of monitored ports, with about 10 ports per device on average being monitored. Which ports are monitored is automatically selected by what is connected to the ports unless you override the defaults to monitor ports that are not connected to monitored computers or devices, or to disable monitoring of ports that are stitched to other ports and would be automatically monitored otherwise.

Meawhile, please also check if all firewalls between the management server and the network devices need to allow SNMP (UDP) and ICMP bi-directionally.
https://learn.microsoft.com/en-us/system-center/scom/plan-security-config-firewall?view=sc-om-2019
https://learn.microsoft.com/en-us/system-center/scom/manage-monitor-networkdevice-overview?view=sc-om-2019

Due to create a second RP, as the limit is to ports monitored, if the monitored ports beyond the limit, we suggest to add more Gateway servers.

Hope it can help.

Answer

Hey @SChalakov ,
Thanks for getting back to me on this. I couldn't answer your post because of the 1000 character restriction.

First, filter the Event Logs on my two Gateways (RP) and look for any related warnings or errors.

I did. There were a couple of errors, which I dismissed at first, thinking they were belonging to a previous problem. However, once I restarted the health services on the GWs they came back:

Type mismatch for RunAs profile in workflow "System.NetworkManagement.Cisco.Memory.InsufficientFreeMemory", running for instance "MEM-20" with id:"{some-id}". Workflow will not be loaded. Management group "MGname"

There were a lot of those. Probably one for every workflow used to monitor each network device. Seems like that's the problem. I suppose my Run As profiles are still configured wrongly. Additionally, I checked the distribution of the SNMP account again. It is distributed to every Mgmt and Gw Server as well as to every RP any network devices are in.

The run as profiles are configured thus:
https://ibb.co/k3341D1 (picture insertion seems to be buggy)
(SNMP Monitoring Account profile)
and
https://ibb.co/CtyTvY9
(SNMPv3 Monitoring Account profile)

Second, empty the Health Service cache on those two servers.
That I did but seeing as the error persisted, I suppose the above-mentioned configuration is the source.
Open the Health Explorer of a managed network device and check in the monitor hierarchy if there are monitors, which are initialized.
Every monitor is shown as "Not monitored" (green circle) and there are no events nor any statuses.

Additionally, I deleted some ~110 devices, to get under the 500 devices mark. Made no difference, which makes perfect sense after your answer.
Something I did notice, however; the SNMP v2 devices are monitored correctly.

Another piece of information I might mention is: the devices are already monitored by a different (older MG). They allow my new SCOM env to connect to it but aren't using the new SCOM Servers as a trap receiver. For all I know, that shouldn't be a problem. In any case, the Errors are certainly not issued because of that.

Regards
Simon

Answer

Hi Simon,

do you still see the Memory related events after deleting some of the devices. I am also not quite sure that those events directly are directly related to the issue, as otherwise all network devices should have been affected, not only those, monitored over SNMPv3.
To be honest I have no clue what the cause could be here, particularly when those devices are monitored by your old MG.
Have you considered opening a Support Case with Microsoft?
Have you also checked the account mapping to the SNMPv3 related profiles?

Thanks and Regards,
Stoyan

Answer

As Stoyan suggested, I resolved this issue with help from a ms support engineer.

Long story short, the Run As Profile conf was faulty. We didn't find out what exactly the problem was, but found a workaround.

You can delete all assigned Objects/Classes etc. from the SNMP Monitoring Profiles. After that simply run a network discovery to let SCOM autoconfigure the needed Run As Profiles. Worked like a charm.
Also make sure you have enough CPU power, because the network discovery will sometimes get stuck (without any logging or feedback) otherwise.

Thanks for the help everyone

Best regards & Happy Christmas
Simon

All network devices not monitored

5 answers