question

KennethKirchner-4860 avatar image
0 Votes"
KennethKirchner-4860 asked KennethKirchner-4860 answered

Windows Server 2016 VM stops responding to RDP, Console hangs at profile service, eventually loses IP/NIC

Hello all,

We have an issue on a few of our Windows 2016 servers that started sometime in the last few months and seems to be getting progressively worse. Resetting the system seems to right it for a few days, but inevitably it will slide into the same useless state and require another hard reset. So far the system has bounced back, with sometimes nothing more than a chkdsk, but on our SQL servers this can sometimes take a few minutes of recovery.

These systems run well for a few days, but then we notice that we can no longer connect via RDP. If we try to log in on the VM console, it will usually hang on the "Waiting for user profile service" but that never resolves and the console is stuck on that login until reset. The SQL or web service on the VM continue to run as if there is no problem for several hours, but eventually we will notice the IP address that vCenter shows for the server disappears and the box is now completely isolated. We have to hard reset to restore service.

I have ran SFC on all of these servers and there is no corruption reported. I ran the DISM tools and it does report the component store can be repaired, but looking in the DISM and CBS logs, there are no errors reported, only Info and Warning. We dont seem to have any problem installing Windows updates, we are patched up to the March roll-up. These servers cant reach MS Update servers, so not sure how to clear these DISM issues. I have injected from a KB CAB before, but if the logs dont identify a KB, then what?

This behavior where it works ok for a few days, then services start to die off sounds to me like a memory leak in some component, but Im sure there could be other things. We recently installed Elastic Metricbeat to see if we can spot the process that might be running amok.

So I am looking for some tips on things to watch that might cause RDP/User profile service to die, or a NIC to suddenly stop working. I assume that the VMware tools installed on this server are getting killed or choked out by this supposed runaway process.

Or if anyone is a DISM/CBS guru and wants to tell me how to fix my component store, that would also be appreciated.

remote-desktop-serviceswindows-server-2016
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

KennethKirchner-4860 avatar image
0 Votes"
KennethKirchner-4860 answered

We got nowhere with this. It just stopped happening. So $500 wasted on MS Technical Services. I am going to assume this was some kind of conflict between our antivirus suite and Microsoft Trusted Installer. That seems to be a common thing we saw in the log files when the crash occurred. I guess a Windows update or a McAfee update resolved the issue at some unknown time. I just hope it doesnt come back.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

CarlFan-MSFT avatar image
0 Votes"
CarlFan-MSFT answered KennethKirchner-4860 commented

Hi,
Some ideas may helpful to you:
1. Turns out that somewhere along the line a firewall was dropping ICMP requests. Because the client didn't get a response, detecting a slow network connection timed out, and the user profile service continued with it's loading the profile. Network traffic being dropped of firewall between the RDP components.
Also if you have multiple network cards, make sure your "production" card is on top in the network connections
2.- Start > Run >msconfig >”Service” Tab
- Check the "Hide All Microsoft Services" box and click "Disable All" (if it is not gray)
- Click the "Startup" tab, click "Disable All" and click "OK".
- Please ensure that NLA is not disabled. Then restart the computer.
NOTE: we can go back to normal boot by running msconfig again and checking on Normal Startup in the General tab.
In the Clean Boot Environment, the third party services and applications are disabled, please check the issue persists.
3.Check the user profile size. If we use new user profile, if it could RDP.
4.If we want to repair system component, we could try to use Windows Server 2016 image.
https://bornsql.ca/blog/repair-windows-server-2016-installation/
Hope this helps and please help to accept as Answer if the response is useful.
Best Regards,
Carl

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

First, thanks for responding.
1. I dont think this is the case since this profile service getting hung is happening on the VM console, not the RDP sessions. There is no firewall between the RDP client and this server.
2. Unfortunately we cant leave the server in a state where it is not providing its intended service. It is typically taking 4 days for the server to croak again after a reset.
3. Not sure what you are saying here.
4. I have taken the baseline image and converted it to a WIM file to repair some of our issues. It seems to be cleaning up all the payload corruption issues and leaving only the manifest issues behind. I have repaired these in the past using the CAB from the KB that it identifies, so I think I will be good on repairing all the DISM issues as long as I can find the KB.

I think the next course of action is to disable the security suite, or put it into monitoring, to see if that is causing our issue.



0 Votes 0 ·
VitorMarques-1914 avatar image
0 Votes"
VitorMarques-1914 answered VitorMarques-1914 commented

I have the same problem this started i think in late february begining of march
First the i would reset vm's and it woulkd last a few weeks , lately it's a few days with luck.
They all stop responding and if try to logon with console it hangs on profile
The only "error" i can see in the logs is this
svchost (1068) SoftwareUsageMetrics-Svc: Um pedido para escrever no ficheiro "C:\Windows\system32\LogFiles\Sum\Svc.log"
Î dont fully know if this is the actual problem or byproduct of the hang....
I moved the vm to another server and the problem is the same
i have malware bytes anti ransomware in the servers, i think im going to disable to see if it solves it

· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Do you think this started after applying the February 2021 cumulative update? KB4601318?

0 Votes 0 ·

I dont think so because there were servers that i did not update and the problem was the same

0 Votes 0 ·
KennethKirchner-4860 avatar image
0 Votes"
KennethKirchner-4860 answered

It seems the same for us. We found out the IP is not disappearing, its the just vmtools service being taken down that makes the IP disappear in vCenter. The VM still pings, its just all the services have stopped.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

VitorMarques-1914 avatar image
0 Votes"
VitorMarques-1914 answered

i did an upgrade on one of the vm's with rdp to server 2019 and its the same thing ...
removed all antivirus/ransomware suite to test
i am out of ideas :(

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

KennethKirchner-4860 avatar image
0 Votes"
KennethKirchner-4860 answered

We had 56 of our servers reset last Sunday. An update to our McAfee suite was pushed and this lead to a crash. Seems directly related to msiexec.exe being ran in our case. Luckily all the servers bounced back pretty quickly. You might check your event logs to see if the system is trying to install something just prior to hangs/crashes. I saw on one log, right after the reset, it was talking about the February rollup missing or having some issue. The February rollup seems to have a history of issues and Microsoft had to rush out an SSU to fix them. Not sure what the impact was if you were unlucky enough to install Feb 2021 rollup before SSU was installed. Maybe thats what we are seeing now. We bit the bullet and bought a Microsoft Support Incident. ($500) and they are crunching our memory dumps. For now we are going to try and disable Windows Update/Windows Installer/Windows Module Installer services and see if we at least get some stability back on our critical servers, even if we have to manually update them for now.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

VitorMarques-1914 avatar image
0 Votes"
VitorMarques-1914 answered VitorMarques-1914 edited

Hi i have been testing one vm without "any" antivirus suite and for now its seems ok
Could it be some obscure interaction with windows anti-malware/defender because even with malwarebytes suite installed defender was working in the background
Or could it be some i/o limitation of virtual hard disk ?

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.