question

Monarch-9366 avatar image
1 Vote"
Monarch-9366 asked KenWatson-3017 commented

Hyper-V Cluster Issues - Event ID 252 & 10400

I have a 4 node Hyper-V Cluster running on Windows Server 2016 Datacenter. It is connected to a Tegile/Tintri T4700 storage array via Fibre Channel utilizing Cluster Shared Volumes (CSVs).

NIC resetting has been occurring as far back as I can trace and it would generate one of the following warnings in the System Event Log. The NIC resetting has occurred during general operations and during Hyper-V Live Migrations of VMs. The last one occurred on 5/17/2020 at 8:57am during Live Migration of VMs to the node that NIC reset occurred on.

The network interface "HPE Ethernet 10Gb 2-port 560FLR-SFP+ Adapter #2" has begun resetting. There will be a momentary disruption in network connectivity while the hardware resets.
Reason: The network driver did not respond to an OID request in a timely fashion.

The network interface "HPE Ethernet 10Gb 2-port 562FLR-SFP+ Adapter" has begun resetting. There will be a momentary disruption in network connectivity while the hardware resets.
Reason: The network driver detected that its hardware has stopped responding to commands.

Starting in August of 2019, when the NIC resets occurred it would cause issues with the node of the cluster it occurred on detailed below.

In my research I found this TechNet forum posting talking about this Nic resetting issue. Specifically this section about 1/3 of the way down. https://social.technet.microsoft.com/Forums/en-US/7b95bc5b-02d1-4dbb-a341-0517ae30cd9e/vms-will-get-stuck-stopping-and-unable-to-migrate-servers-from-that-host?forum=winserverhyperv “I had a ticket lodged with Microsoft support. While they didn't fix the issue, I ended up finding the root cause. One of the SFP+ Adapters was generating a 10400 NDIS event stating that the driver detected that the hardware wasn't responding to instructions, so Windows would then reset the adapter. The Adapter was part of a NIC team which was then used for a vswitch in Hyper-V. For some reason when the adapter gets reset, it generates an error with the vswitch which then seems to completely break the VMMS service.
Microsoft has offered no explanation as to why this happens. The point of NIC teaming is so that if one adapter drops, everything can keep working. We ended up updating drivers, and I logged a call with the OEM to get firmware and other updates done. All we can do now is cross our fingers that it doesn't error again.”

This is what is happening. The VMMS service is so broken that I have to shut down every VM on the node with the issue. I then try to restart the node but it gets stuck trying to shutdown and I have to force a power off and when it resets the VMs move to a different node and start back up. Not good.

I have also received Event ID 252 in the System Event Log regarding “Memory allocated for packets in a vRss queue (on CPU 28) on switch C0978781-75EF-47B4-B9BC-6463064735A0 (Friendly Name: Team_Trunked) due to low resource on the physical NIC has increased to 256MB. Packets will be dropped once queue size reaches 512MB." which has occurred before some NIC resets have occurred.

I continued to have 252 events and 10400 nic resets particularly during live migrations after switching to a converged networking model. I decided to move the live migration traffic to a separate team of nics in an attempt to avoid live migrations causing Hyper-V to go into an unusable state. Nics resets had stopped during live migration since I made the change in May. My HPE engineer also recommended setting the "Maximum Number of RSS Queues" to a higher number to help aleviate the 252 events.

From 5/18/20 - 8/23/20 I had zero issues and thought I finally was in the clear. Wrong!

On 8/24/at 2:55 PM on Node 4, one of the 10Gb Nics of the team for the Hyper-V-VmSwitch reset (10400 event), no issues occurred with Hyper-V or the Cluster because it was only 1 nic of the team. One thing to note was that this was the 1st day of classes for the fall semester on our campus.

As you can see by the list of events below, I continued to have some 252 and 10400 events but they did not break the Hyper-V Virtual Machine Management service until 9/23/20. On this day 2 nodes of the cluster, Virtualsrv3 and Virtualsrv4 experienced Nic resets on both 10Gb Nics of the Nic team used by the Hyper-V-VmSwitch.

I have had support cases open with Microsoft and HPE but no one has been able to find the answer to why this continues to happen. Microsoft said to increase the "Receive Buffers" Nic setting from 512 to 2048 but that did not help either.

I also had a TechNet forum post going for some time on this as well:
https://social.technet.microsoft.com/Forums/en-US/ad05bf98-2a2f-423f-83a6-284b5fd1265e/cluster-node-event-252-cluster-service-crashed?forum=winserverhyperv

If anyone has had this issue and found an answer to it please let me know. Thank you.

8/28/2020
Virtualsrv4 – 252 (256MB) – 4:52:04 PM CDT
Virtualsrv1 – 252 (256MB) – 4:52:04 PM
Virtualsrv3 – 252 (256MB) – 4:52:05 PM
Virtualsrv1 – 252 (512MB) – 4:52:07 PM
Virtualsrv4 – 252 (512MB) – 4:52:08 PM
Virtualsrv4 – 10400 - 4:52:13 PM (Team_Trunked)
Virtualsrv3 – 10400 - 4:52:14 PM (Team_Trunked)
Virtualsrv1 – 10400 - 4:52:17 PM (Team-LM)
One 1 Nic reset per team so no cluster issues came of it
There were no Live Migrations going on at the time of these events
Veeam backups were occurring at this time

8/31/2020
Virtualsrv1 – 252 (256MB) – 2:51:10 PM
Virtualsrv2 – 252 (256MB) – 2:51:10 PM
No Veeam backups were occurring at this time

9/9/2020
Virtualsrv1 – 252 (256MB) – 7:13:54 PM
Virtualsrv2 – 252 (256MB) – 7:13:54 PM
Virtualsrv3 – 252 (256MB) – 7:13:55 PM
Virtualsrv4 – 252 (256MB) – 7:13:54 PM
No Veeam backups were occurring at this time

9/16/2020
Virtualsrv1 – 252 (256MB) – 4:55:05 PM
Virtualsrv2 – 252 (256MB) – 4:55:05 PM
Virtualsrv3 – 252 (256MB) – 4:55:05 PM
Veeam backups started at 4:30pm

9/23/2020
Virtualsrv1 – 252 (256MB) – 7:03:31 PM (CPU 12)
Virtualsrv1 – 252 (512MB) – 7:03:34 PM (CPU 12)
Virtualsrv1 – 252 (256MB) – 7:04:23 PM (CPU 26)
Virtualsrv1 – 252 (512MB) – 7:04:26 PM (CPU 26)
Virtualsrv1 – 10400 - 7:04:33 PM (Team-LM)
Virtualsrv2 – 252 (256MB) – 7:03:31 PM (CPU 86)
Virtualsrv2 – 252 (512MB) – 7:03:33 PM (CPU 86)
Virtualsrv2 – 252 (256MB) – 7:04:23 PM (CPU 80)
Virtualsrv2 – 252 (512MB) – 7:04:26 PM (CPU 80)
Virtualsrv3 – 252 (256MB) – 7:03:39 PM (CPU 2)
Virtualsrv3 – 10400 - 7:03:40 PM (Team-Trunked) different nics
Virtualsrv3 – 10400 - 7:03:41 PM (Team-Trunked) different nics
Virtualsrv3 – 252 (512MB) – 7:03:52 PM (CPU 2)
Virtualsrv3 – 252 (256MB) – 7:04:24 PM (CPU 54)
Virtualsrv3 – 252 (512MB) – 7:04:27 PM (CPU 54)
Virtualsrv3 – 252 (256MB) – 7:04:33 PM (CPU 12)
Virtualsrv3 – 252 (512MB) – 7:04:36 PM (CPU 12)
Virtualsrv3 – 10400 - 7:04:41 PM (Team-Trunked) same nic
Virtualsrv3 – 10400 - 7:05:05 PM (Team-Trunked) same nic
Virtualsrv4 – 252 (256MB) – 7:03:31 PM (CPU 126)
Virtualsrv4 – 252 (512MB) – 7:03:34 PM (CPU 126)
Virtualsrv4 – 10400 - 7:03:37 PM (Team-Trunked) different nics
Virtualsrv4 – 10400 - 7:03:39 PM (Team-Trunked) different nics
Virtualsrv4 – 252 (256MB) – 7:04:24 PM (CPU 126)
Virtualsrv4 – 252 (512MB) – 7:04:27 PM (CPU 126)
Veeam backups had just started at 7:00pm

windows-server-hyper-vwindows-server-clustering
· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi,

Due to the issue need log analysis, I would like to suggest open case to Microsoft.

https://support.microsoft.com/en-us/gp/contactus81?Audience=Commercial&wa=wsignin1.0

Thanks for your understanding, if you have any question, please feel free to let me know.

Best Regards,

Daniel

0 Votes 0 ·

I had a case open with Microsoft last year for this and it didn't solve any issues so I am hesitant to open another one at a cost of $330/hour when many hours will be spent without much happening as this is a very complicated issue and no answers have ever been posted to fix it. That is why I currently have a case open with our hardware vendor in hopes something will come out of that. I posted this here in hopes that someone who encountered this would respond with some helpful insight. I have a lot to share with anyone that can help in this situation.

0 Votes 0 ·

Monarch, Have you been able to get any information as to why this is happening. I am also experiencing the exact same issues using Intel servers and Intel x710 network adapters. I have not been able to figure anything out as to what is causing this.

Ken

0 Votes 0 ·

0 Answers