LiveMigration failure on Hyper-V cluster

Cousin_Hub 15 Reputation points
2023-03-21T14:35:18.24+00:00

I have a tricky one here, and I hope someone will have an idea about what is going on.

I was hired a few month ago in a company that has a 4 servers Hyper-V cluster.

Servers have been bought separately, let's say server 1 and 2 first, and two years after server 3 and server 4.All are the same model from HPE, only processors generation changes as two servers are older.Hosts are under Server 2019, VMs are mixed between 2016 and 2019, and all of them are up to date regarding microsoft monthly security patches.

Before I arrived, LiveMigration was working perfectly, but between the time previous admin left and my arrival, it started to fail frequently for some VMs, specifically when draining a node to patch it.

Role raises a 21502 error (Migration failure for virtual computer XXXX to destination host YYYY) when trying to live migrate, but with no additional information.

Investigating deeper, I noticed that if VM has been started from older hosts, it can be liveMigrated to every one of them.

If VM has been started from on of the newer hosts, it can be live migrated only to newer hosts.

Running Compare-VM to one of the older hosts reveals a 21026 ID incompatibility, but doesn't give anymore clues about what is causing it.

Processors compatibility is enabled for all VMs, I checked that all networks are declared the same way in each host, I even compared processors options in each host's UEFI settings but I don't have any clues for what is causing this issue.

I followed this guide from Microsoft, but no luck either : https://learn.microsoft.com/en-us/troubleshoot/windows-server/virtualization/troubleshoot-live-migration-issues

Anyone has ideas on investigation trails?

Windows Server 2019
Windows Server 2019
A Microsoft server operating system that supports enterprise-level management updated to data storage.
3,474 questions
Hyper-V
Hyper-V
A Windows technology providing a hypervisor-based virtualization solution enabling customers to consolidate workloads onto a single server.
2,551 questions
Windows Server Clustering
Windows Server Clustering
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.Clustering: The grouping of multiple servers in a way that allows them to appear to be a single unit to client computers on a network. Clustering is a means of increasing network capacity, providing live backup in case one of the servers fails, and improving data security.
960 questions
{count} vote

3 answers

Sort by: Most helpful
  1. Cousin_Hub 15 Reputation points
    2024-01-23T16:39:35.3433333+00:00

    I escalated HPE and got an answer! They first provided a link to a Microsoft KB, but Microsoft have removed this to update it and republish later (I don't know when). Here is the information I got from HPE L3 Engineer (the answer was in French, hope the translation is clear) : "posted an update on case SIE268522, update #57, on January 9, 2024, to indicate that Microsoft replaced the two KB articles with old learning articles without informing HPE. We then found out from Microsoft that they removed these articles because they wanted to update their content. The root cause of the issue is that in WS2019 with Hyper-V enabled, on EPYC Gen1/2 processors, the root/host OS would indicate ssbd support in the global speculation data structure. However, on EPYC Gen3/4 processors, the root/host OS would not indicate ssbd support in the global speculation data structure. The fix is in WS2022. Once Microsoft KB articles are available again, we will notify the customer." So according to HPE, themselves according to Microsoft, migrating to Server 2022 should solve this issue. I haven't migrated (my Hypervisors are still in 2019), so I can't confirm yet this solution. But I'm happy as my investigations were going on that direction, I was more than conviced that processor capabilities exposure was part of problem :) I hope this will help!

    2 people found this answer helpful.

  2. Amit Singh 4,846 Reputation points
    2023-03-22T10:14:04+00:00

    Check if the anti-virus on the host computer is causing the problem. Exclude it from doing 'on-access' scanning of the CSV.


  3. Kristian Halvorsen 0 Reputation points
    2023-11-29T10:06:03.24+00:00

    Hello,

    Did you ever find a solution? I have the EXACT same problem..