Hello all,
So we are facing very frustrating issue with Hyper-v Cluster. Issue happens randomly, we cannot get hang of it.
Just...like that, random VM or VMs cluster resources goes in let's say LOOP state. Pictures will say everything my words cannot.
And what you see in first 4 pictures goes in loop like 100 per minute. If you click on it Cluster management console will crash.
On the other side you can manage machine trough hyper-v console, but VM did reset few times during this error.
When I go trough logs I cannot figure out what could cause this, because VM is actually working (until it gets reset)
Log entry from: "Applications and Services Logs\Microsoft\Windows\Hyper-V-StorageVSP" last 9 entrys out of 1800 all made inside one second
1.)Storage device '\?\UNC\NAMEsofs01\Cluster\VMs\XXX-PROD-WEB02\VHDs\XXX-prod-web02_sys.vhdx' received a recovery status notification. Current device state = Recoverable Error Detected, Last status = No Errors, New status = Disconnected.
2.)Storage device '\?\UNC\NAMEsofs01\Cluster\VMs\XXX-PROD-WEB02\VHDs\XXX-prod-web02_sys.vhdx' changed recovery state. Previous state = Recoverable Error Detected, New state = Recoverable Error Detected.
3.)Storage device '\?\UNC\NAMEsofs01\Cluster\VMs\XXX-PROD-WEB02\VHDs\XXX-prod-web02_sys.vhdx' received a recovery status notification. Current device state = Recoverable Error Detected, Last status = Disconnected, New status = No Errors.
4.)Storage device '\?\UNC\NAMEsofs01\Cluster\VMs\XXX-PROD-WEB02\VHDs\XXX-prod-web02_sys.vhdx' changed recovery state. Previous state = Recoverable Error Detected, New state = No Errors.
5.)Storage device '\?\UNC\NAMEsofs01\Cluster\VMs\XXX-PROD-WEB02\VHDs\XXX-prod-web02_sys.vhdx' received an IO failure with error = SRB_STATUS_ERROR_RECOVERY. Current device state = No Errors, New state = Recoverable Error Detected, Current status = No Errors.
6.)Storage device '\?\UNC\NAMEsofs01\Cluster\VMs\XXX-PROD-WEB02\VHDs\XXX-prod-web02_sys.vhdx' received a recovery status notification. Current device state = Recoverable Error Detected, Last status = No Errors, New status = Disconnected.
7.)Storage device '\?\UNC\NAMEsofs01\Cluster\VMs\XXX-PROD-WEB02\VHDs\XXX-prod-web02_sys.vhdx' changed recovery state. Previous state = Recoverable Error Detected, New state = Recoverable Error Detected.
8.)An I/O request for device '\?\UNC\NAMEsofs01\Cluster\VMs\XXX-PROD-WEB02\VHDs\XXX-prod-web02_sys.vhdx' took 1216203 miliseconds to complete. Operation code = READ16, Data transfer length = 512, Status = SRB_STATUS_ABORTED. ###HERE VM has quit unexpectedly
9.)Storage device '\?\UNC\NAMEsofs01\Cluster\VMs\XXX-PROD-WEB02\VHDs\XXX-prod-web02_sys.vhdx' received a recovery status notification. Current device state = Shutting Down, Last status = Disconnected, New status = No Errors.
This entry has timestamp when problem started 1:51:46. And there is no later logs of this kind but ClusterResurce was still in Loop like in first 4 pictures. And you cannot kill that loop.
Logo from Hyper-V worker:
1.)'name-prod-web02' was resumed from critical error. (Virtual machine ID 08BFD5A3-AF52-4F66-BD01-C635FED8F87A)
2.)'name-prod-web02' was paused for critical error. (Virtual machine ID 08BFD5A3-AF52-4F66-BD01-C635FED8F87A)
3.)'name-prod-web02' was resumed from critical error. (Virtual machine ID 08BFD5A3-AF52-4F66-BD01-C635FED8F87A)
and i circle 2000 logs in one second in time 1:51:46
Can you please give me some idea, directions, anything. Problem is random on windows and linux machines, and random nodes.
I will provide you with any additional info, I simply have nothing else to give from logs.
Pero