Hello all,
We have an issue on a few of our Windows 2016 servers that started sometime in the last few months and seems to be getting progressively worse. Resetting the system seems to right it for a few days, but inevitably it will slide into the same useless state and require another hard reset. So far the system has bounced back, with sometimes nothing more than a chkdsk, but on our SQL servers this can sometimes take a few minutes of recovery.
These systems run well for a few days, but then we notice that we can no longer connect via RDP. If we try to log in on the VM console, it will usually hang on the "Waiting for user profile service" but that never resolves and the console is stuck on that login until reset. The SQL or web service on the VM continue to run as if there is no problem for several hours, but eventually we will notice the IP address that vCenter shows for the server disappears and the box is now completely isolated. We have to hard reset to restore service.
I have ran SFC on all of these servers and there is no corruption reported. I ran the DISM tools and it does report the component store can be repaired, but looking in the DISM and CBS logs, there are no errors reported, only Info and Warning. We dont seem to have any problem installing Windows updates, we are patched up to the March roll-up. These servers cant reach MS Update servers, so not sure how to clear these DISM issues. I have injected from a KB CAB before, but if the logs dont identify a KB, then what?
This behavior where it works ok for a few days, then services start to die off sounds to me like a memory leak in some component, but Im sure there could be other things. We recently installed Elastic Metricbeat to see if we can spot the process that might be running amok.
So I am looking for some tips on things to watch that might cause RDP/User profile service to die, or a NIC to suddenly stop working. I assume that the VMware tools installed on this server are getting killed or choked out by this supposed runaway process.
Or if anyone is a DISM/CBS guru and wants to tell me how to fix my component store, that would also be appreciated.