Cluster failover scenarios on your Azure Stack Edge Pro GPU device

This article identifies the common failover scenarios, how the Azure Stack Edge device responds, and the overall impact on the workloads deployed on the cluster should a failover occur.

About failover

Azure Stack Edge can be set up as a single standalone device or a two-node cluster. In a two-node cluster, the clustered nodes provide high availability for applications and services that are running on the cluster.

If one of the clustered node fails, the other node begins to provide service - this process is known as failover. Failover may also occur if hardware components associated with one or both nodes of your device such as disk drives, power supply units (PSUs), or network fail or when you update your device nodes.

Failover scenarios

Failover may occur as a result of hardware component failure, node failure or when updating the Azure Stack Edge cluster.

Hardware failures

These tables summarize the failure scenarios for a physical hardware component associated with your device cluster such as one or more of disk drives, power supply, or network.

Disk drive failures

Node A Node B Cluster survives Failover Details
1 disk drive fails No failures Yes No Cluster is degraded until the disk is replaced.
2 or more disk drives fail No failures Yes No Cluster is degraded until the disk is replaced.
1 or more disk drives fail 1 or more disk drives fail No Cluster goes offline.

Power supply unit failures

Node A Node B Cluster survives Failover Details
1 PSU fails No failures Yes No Another power supply failure on node A will result in failover to node B.
1 PSU fails 1 PSU fails Yes No Another power supply failure on either node will result in failover.
2 PSUs fail No failures Yes Yes VMs on node A fail over to node B.
2 PSUs fail (TBC) 1 PSU fails Yes Yes VMs on node A fail over to node B.
2 PSUs fail 2 PSUs fail No Cluster goes offline.

Network failures

Node A Node B Cluster survives Failover Details
Port 1, Port 2, Port 5, or Port 6 fails No failures Yes No Failed port is unavailable. Apps listening on this port are impacted
1 or both of Port 3 and Port 4 fail No failures Yes Yes VMs on node A fail over to node B

Node failures and updates

Node failure

This table summarizes the failure scenarios when an entire node has failed on your cluster.

Node A Node B Cluster survives Failover Details
Entire node fails No failures Yes Yes VMs from node A fail over to node B
Entire node fails Entire node fails No - Cluster goes offline
Reboot No failures Yes Yes VMs from node A fail over to node B
Reboot Reboot No - Cluster is offline until the reboot completes
Core component fails. For example, motherboard, DIMM, and OS disk. No failures Yes Yes VMs from node A fail over to node B
Core component fails. For example, motherboard, DIMM, and OS disk. Core component fails. For example, motherboard, DIMM, and OS disk. No - Cluster goes offline

Node update

Node A Node B Cluster survives Failover Details
Node update No failures Yes Yes VMs from node A fail over to node B
Node update 2 PSUs fail No - Cluster goes offline
Node update Entire node fails or goes offline No - Cluster goes offline
Node update Reboot No - Cluster goes offline
Node update Core component fails such as motherboard, DIMM, and OS disk. No - Cluster goes offline

Next steps