question

vafran avatar image
0 Votes"
vafran asked vafran answered

SCOM 1807 - Unable to add new agents/clients

Hello,

Suddenly I am unable to add new agents to the SCOM 2016 environment. They are installed correctly, but appear as not monitored in the console.

32242-image.png

On the agents I get the following events:

20070
20071
21016 (less frequently)



On the SCOM server I get events 20000.

The agent control panel shows FQDN of management server and port 5723, which is opened. Also there are no certificates in the environment.

I searched for the issue and ended up removing all objects in maintenances state.

This seems to have started happening since I added a new SQL cluster, but it may be a coincidence. I added the first two nodes just fine, but I added a third one a few weeks later (which is a reinstall of a node from another cluster, preserving the server name), was the first client/agent I found to fail. Also the instances of this cluster are detected from the proxy node, but not also appear as not monitored.

The agents are added from the SCOM console, but if I install manually and then approve it from the console, the situation is exactly the same.

This is the line of events in a completely newly installed agent in a new server


32147-image.png


Any advice?


msc-operations-manager
image.png (3.5 KiB)
image.png (45.2 KiB)
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

LeonLaude avatar image
1 Vote"
LeonLaude answered LeonLaude edited

Hi @AaronVazquez-7771,

Does this happen to all new agents you're trying to install either manually or by pushing from the Operations Console?

Which Update Rollup are you running in your SCOM 2016 environment?

If an agent computer has been upgraded but retained the name, did you ensure to uninstall the agent, make sure it was gone in the Operations Console, and then try to re-install it on the new computer?

A few things to check:

  • Start by clearing the cache of the SCOM management servers (How and When to Clear the Cache)

  • Check the Health of your SCOM management group (Monitoring > Operations Manager > Management Group Health)

  • Check that the agent computer has the correct management server/group information in the registry (HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Agent Management Groups\<Management Group Name>\Parent Health Services\0\

  • Have a look at Kevin's blog here: https://kevinholman.com/2014/10/27/agents-that-never-connect-to-management-server/


(If the reply was helpful please don't forget to upvote or accept as answer, thank you)


Best regards,
Leon

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

vafran avatar image
0 Votes"
vafran answered StoyanChalakov commented

Hi @LeonLaude.

Does this happen to all new agents you're trying to install either manually or by pushing from the Operations Console?
All of them, either way.

Which Update Rollup are you running in your SCOM 2016 environment?
Sorry, it is SCOM1807, not 2016.

I cleared the cache as per the article.
Then I reviewed the SCOM management group, and after a few minutes all status are greyed out.

I received this alert: "The All Management Servers Pool has not reported availability since Wed, 14 Oct 2020 07:19:09 GMT. This adversely affects all availability calculation for the entire management group." But I hope this will fix itself after a while?


32226-image.png
32110-image.png

Firs thing I did is to increased OM Database max size.

I already had checked the last two points in my previous troubleshooting.

I can see the snapshot synchronization error in the ManagementServer event log. Not sure how relevant this may be.

32227-image.png



image.png (16.0 KiB)
image.png (39.8 KiB)
image.png (49.8 KiB)
· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Have you made sure no management server is in maintenance mode?
The issue you are having is that the SCOM management group is grayed out, this could be due to database connectivity issues or if the database has had its disks full.

Since you mentioned that you added more disk to the SCOM databases (make sure both Operations Database & Operations Data Warehouse have at least 40-50% free disk space), try clearing the cache on the SCOM management servers once again.

Also make sure to continue monitoring the Operations Manager event log for ANY warnings and errors.

If the cache clearing does not help, try rebooting the management servers, one by one.




1 Vote 1 ·

Hey,
please start by checking the events on your management server, it seems that you have an issue with the management group itself. Pots the related events here, we will try to help you out.

Regards,
Stoyan

0 Votes 0 ·
vafran avatar image
0 Votes"
vafran answered vafran commented

Hey there. The management group was not greyed out until after deleting the cache.

Id not see nay other event errors on the management server itself, but this information event is creeping me out:

Event 21023
OpsMgr has no configuration for management group XXXXXX and is requesting new configuration from the Configuration Service.

This is only happening since the cache was deleted on the management server, around 90 minutes ago.

32291-image.png



image.png (19.4 KiB)
· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Clearing the cache shouldn't put the SCOM management servers into a grayed out state like this, then it means something is definitely not right, and since you cannot get your agents installed there clearly is an issue.

Did you also clear the Operations Console's cache? Just wanted to make sure you did so you are viewing "fresh" information.

What's the status of the monitored agents, also grayed out?

For the event 21023, please refer to the steps mentioned here:
https://docs.microsoft.com/en-us/archive/blogs/thomase/event-id-21023-management-server-not-downloading-configuration-files

1 Vote 1 ·

Hi.

Yes, I started the console from the management server itself with the /clearcache parameteter.

All the ""pre-existing" monitored agents are in good condition. However they are not reporting, because I had some alerts of stopped services in a server, just before this happened, and the services have been started but they still appear as stopped in SCOM.

Now I see the same events on all monitored servers, this is from one that was working until earlier this morning:

32255-image.png


0 Votes 0 ·
image.png (32.4 KiB)
StoyanChalakov avatar image
1 Vote"
StoyanChalakov answered vafran commented

Hi,

I absolutely agree with Lein, clearing the cache of a management server is a standard troubleshooting procedure and should not influence the functionality of a management server in any way. Can you please confirm two more things:

  • Can you please make sure that your "Health State" folder (the cache) is not being scanned by an AV software. This is important, because AV programs are locking files, which can cause cache corruption and consequent issues. You need to ensure that the proper exclusions are made:

Configuring antivirus exclusions for agent and components

  • The second thing, mentioned by Leon is to make sure you have no DB connectivity or performance issues. Those are usually indicated by a particular event - Warning, 2115:

Troubleshoot event ID 2115-related performance problems in Operations Manager

Can you please verify this?

Thanks and Regards,
Stoyan


· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi @StoyanChalakov ,

I sent this to the security team, but no locks are being done by the AV. I restored VM of management servver from a snapshot of a few hours ago and it siw orking fine now, but still new agents cannot report.

Now this does not look any good. it seems since day 9 I am facing this problems. Maybe a BBDD restore would help?

32304-image.png


0 Votes 0 ·
image.png (77.1 KiB)

When restoring a SCOM management server, you can follow up on Kevin's blog over here:
https://kevinholman.com/2018/10/29/recovering-a-scom-management-server/

How does the Health Explorer look like for the rest of the objects under the Management Group Infrastructure?

32343-management-group-infrastructure.png



1 Vote 1 ·

Hi Leon, its all green expceto for the mgmt group.

32353-image.png


0 Votes 0 ·
image.png (16.8 KiB)
StoyanChalakov avatar image
1 Vote"
StoyanChalakov answered StoyanChalakov commented

Hi Aaron,

I continue thinking that it might be related to your database. Did you check for 2115 Warnings on the management server?

Please check also the suggested actions when this fires:

Causes
This can happen when:
The database or database server is unavailable (networking issue, firewall, disk space, etc.)
The System Center Management Configuration Windows Service account no longer has the required access to the database
The “AgentPoolAssignment” work item has been disabled in the ConfigService.config. The ConfigService.config file is located in “%Program Files%\Microsoft System Center 2012 R2\Operations Manager\Server”.

Resolutions
To further investigate the issue, consider the following:
Review the Operations Manager event log for errors indicating problems with the System Center Management Configuration Service. Filter the event log a source of “OpsMgr Management Configuration” to search for errors.
Confirm you are not seeing connection error to the Operations Manager database from the management server in the Operations Manager event log
Using the Operations Manager Console and SQL Server Management Studio, validate the Default Action Account has the correct access to the database where the Operations Manager database is installed. For more information about configuring the Default Action Account please see the Operations Manager Security Guide.
Open the ConfigService.config file and search for “AgentPoolAssignment” under WorkItems. Make sure Enabled property is set to true. The ConfigService.config file is located in “%Program Files%\Microsoft System Center 2012 R2\Operations Manager\Server”.

Can you please verify this!

Regards,
Stoyan

· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello @StoyanChalakov .

Thanks for the help. There si definitely somthing funny going on. I tried to flush the mgmt server cache again, and again it all stopped working, showing all the mgmt griup infrastructure greyed out.


I only received a few 2115 yesterday while I was doign server reboots adn SQL reboots, but none before that.

I think I will try retoring both the management and the database to a previous point in time.

0 Votes 0 ·

Hi @AaronVazquez-7771,

definitely very strange behavior. Please do let us know how the restore goes, I really hope you have a working and "healthy" backup.
Please post here if you need help or have questions.

Thanks and Regards,
Stoyan

0 Votes 0 ·
vafran avatar image
0 Votes"
vafran answered

Hello,

I did recover form backup from a definitely working date, but althoug at first instance it seems to work, it goes back to the non working situation.

The culprit is the delta syncronization:

The System Center Management Configuration Service has failed to perform the Configuration Store Delta synchronization state task in an acceptable amount of time.

The purpose of this monitor is to determine if the Configuration Service has failed to run the “DeltaSynchronization “work item over the last 15 minutes (default). The impact of the “DeltaSynchronization” work item failing is during this time the management group could experience inconstant behaviors about its ability to update Agents with new configuration.


So I have this events 29181 for failed sync with System.InvalidCastException error.

Previous to the restore, the following query on the database woudl return 10, as failed:

select * from cs.WorkItem where workitemname like '%snapshot%' order by StartedDateTimeUtc desc


After restore, I got just one correct sync (20) but no other attmnept, while bvefore the restore there was one failed attempt every few seconds.

![37057-image.png][1]

Check this:
https://docs.microsoft.com/en-us/troubleshoot/system-center/scom/configuration-not-updated-with-event-29181

But those settings in my build are already well above this numbers, and the environment is not so large.

Also this is not time timeout, like the most issues found in forums, but a invalidcastexception type of error.


5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.