White Paper for Built-in High Availability Protocol for HPC Pack 2019

Overview

With the release of HPC Pack 2019 we've added a new HA model, which relies on a SQL always-on instance. Compared to the previous Service Fabric based implementation:

  • HPC Pack will have the same HA ability when one of the head nodes fails.
  • Only 2 nodes are required to have HA ability.
  • A thin layer of HA library is provided with HPC Pack which relies on SQL always-on instead of Service Fabric.

HA model comparison

Character \ HA model No HA Service Fabric Based on SQL
Underlying mechanism No Service Fabric SQL always-on
Least nodes needed 1 3 2
Failover when current primary node fails No Yes Yes
Operating when SQL server fails No No No

Architecture

We can illustrate the architecture using HPC Scheduler as an example. The architecture itself is simple. HPC Scheduler uses one of provided client implementation (SQL by default) to communicate with one of the server implementations (also SQL by default).

Diagram shows the H P C Scheduler architecture.

Picture 1 - Architecture of HA HPC Scheduler Service

Most other services in HPC Pack using built-in HA protocol the same way as Scheduler service, except stateless services. The reason is stateless services are running on all the head nodes so they don't need to go through the leader election process. While usually we only use the stateless service instance running on the same node as Scheduler service.

Protocol Detail

Design

Parameters

  • I: interval for heartbeat (e.g. 1 sec)
  • T: heartbeat timeout (e.g. 5 secs)
  • T > 2 * I

Data

  • Heartbeat Table: A table in the external HA system contains heartbeat entry.
  • Heartbeat Entry: in the format {uuid, utype, timestamp}
  • ha_time: current date time of the external HA system
  • All time is in UTC time

Procedures

  • UpdateHeartBeat(uuid, utype):

    For each type, update entry {old_uuid, utype, old_timestamp} in heartbeat table with {uuid, utype, ha_time}.

    For each type, if uuid is not equal to old_uuid, then (ha_time – old_timestamp > T) must be satisfied.

    The update process uses optimistic concurrency control. e.g. if the heartbeat entry has been updated before another heartbeat reaches, the later heartbeat is discarded.

  • GetPrimary(utype):

    Return (uuid, utype) in heartbeat entry with the corresponding query utype if (ha_time - timestamp <= T). Else return empty value.

Algorithm

  1. After a client S started, it generates a unique instance ID uuid to identify itself and marks itself with the exact utype, which it will work as in the future.

  2. S calls GetPrimary(utype) every I secs.

  3. If GetPrimary(utype) returned empty value, S calls UpdateHeartbeat(uuid, utype).

  4. Continue to call GetPrimary(utype) every I secs.

    a. If subsequent call to GetPrimary(utype) returns (uuid, utype) generated in 1, S will then work as primary.

    b. If subsequent call to GetPrimary(utype) returns a unique ID which is different from uuid and the same type with utype generated in 1, go back to 2.

    c. If subsequent call to GetPrimary(utype) returns an empty value / a corrupted message, error occurred in 3. Retry 3.

  5. S call UpdateHeartBeat(uuid, utype) and GetPrimary(utype) every I sec.

    a. If GetPrimary(utype) returns anything except (uuid, utype), or didn't return for (T - I) secs, exit itself and restart.

Implementation

The HA implementation is open-sourced at GitHub

Deploy

Please refer to deployment documentation of HPC Pack 2019.

Configuration

There are two configurations related to the built-in HA protocol.

  • ServiceAffinity (default to 1)

    • ServiceAffinity means if other HPC services should run on the same node HPC Scheduler runs. The default value is 1, which means affinity is on. This is also the recommended value.
    • To check the setting, run PowerShell command Get-HpcClusterRegistry and find ServiceAffinity in the output.
    • To change the setting, run PowerShell command Set-HpcClusterRegistry -PropertyName ServiceAffinity -PropertyValue <new-value>
    • After the affinity setting is changed, all head nodes need to be restarted.
  • HeartBeatTimeOut (ms, default to 10000)

    • This is the heartbeat time out used by above algorithm.
    • This setting also affect heartbeat interval. Interval is set to floor(HeartBeatTimeOut / 5). So this value can never be set to less then 5, otherwise the SQL server will be flooded by heartbeat message.
    • The only way to check and change this value is via SQL table [HPCHAWitness].[dbo].[ParameterTable].

    Note: It is NOT recommended to change the value of HeartBeatTimeOut. The only exception will be during a degradation of network performance of the SQL Always-on servers. When changing this value, make sure all HPC Services are stopped.

Replace One of the Head Nodes

Remove the head node that needs to be replaced in cluster manager. Then, install a new head node following the normal HPC HA head node installation process. For more information, refer to the HPC Pack 2019 deployment documentation.

Diagnostic Process and Tools

Diagnostic process is the same as other HPC Clusters. If a service failover is abnormal, we will need to check the service log of it.

Check here about how to collect service logs and tools available for reading binary logs.

Recommendations

  • In HA clusters, head node servers need to be robust enough to run all the timer instructs in time. This means on head node, CPU load should be less then 90% at all time.
  • In HA clusters, we recommend you remove compute node and broker node role from all head nodes.
  • HPC clusters, with or without HA, need underlying SQL server working properly to function. If the SQL server is constantly under heavy load, we recommend you to upgrade the server instance. Same for the network between head nodes and SQL server.