White Paper for Built-in High Availability Protocol for HPC Pack 2019
With the release of HPC Pack 2019 we've added a new HA model, which relies on a SQL always-on instance. Compared to the previous Service Fabric based implementation:
- HPC Pack will have the same HA ability when one of the head nodes fails.
- Only 2 nodes are required to have HA ability.
- A thin layer of HA library is provided with HPC Pack which relies on SQL always-on instead of Service Fabric.
Character \ HA model | No HA | Service Fabric | Based on SQL |
---|---|---|---|
Underlying mechanism | No | Service Fabric | SQL always-on |
Least nodes needed | 1 | 3 | 2 |
Failover when current primary node fails | No | Yes | Yes |
Operating when SQL server fails | No | No | No |
We can illustrate the architecture using HPC Scheduler as an example. The architecture itself is simple. HPC Scheduler uses one of provided client implementation (SQL by default) to communicate with one of the server implementations (also SQL by default).

Picture 1 - Architecture of HA HPC Scheduler Service
Most other services in HPC Pack using built-in HA protocol the same way as Scheduler service, except stateless services. The reason is stateless services are running on all the head nodes so they don't need to go through the leader election process. While usually we only use the stateless service instance running on the same node as Scheduler service.
I
: interval for heartbeat (e.g. 1 sec)T
: heartbeat timeout (e.g. 5 secs)T > 2 * I
Heartbeat Table
: A table in the external HA system contains heartbeat entry.Heartbeat Entry
: in the format{uuid, utype, timestamp}
ha_time
: current date time of the external HA system- All time is in UTC time
UpdateHeartBeat(uuid, utype)
:For each type, update entry
{old_uuid, utype, old_timestamp}
in heartbeat table with{uuid, utype, ha_time}
.For each type, if
uuid
is not equal toold_uuid
, then (ha_time – old_timestamp > T
) must be satisfied.The update process uses optimistic concurrency control. e.g. if the heartbeat entry has been updated before another heartbeat reaches, the later heartbeat is discarded.
GetPrimary(utype)
:Return
(uuid, utype)
in heartbeat entry with the corresponding query utype if (ha_time - timestamp <= T
). Else return empty value.
After a client S started, it generates a unique instance ID
uuid
to identify itself and marks itself with the exactutype
, which it will work as in the future.S calls
GetPrimary(utype)
everyI
secs.If
GetPrimary(utype)
returned empty value, S callsUpdateHeartbeat(uuid, utype)
.Continue to call
GetPrimary(utype)
everyI
secs.a. If subsequent call to
GetPrimary(utype)
returns(uuid, utype)
generated in 1, S will then work as primary.b. If subsequent call to
GetPrimary(utype)
returns a unique ID which is different fromuuid
and the same type withutype
generated in 1, go back to 2.c. If subsequent call to
GetPrimary(utype)
returns an empty value / a corrupted message, error occurred in 3. Retry 3.S call
UpdateHeartBeat(uuid, utype)
andGetPrimary(utype)
everyI
sec.a. If
GetPrimary(utype)
returns anything except(uuid, utype)
, or didn't return for(T - I)
secs, exit itself and restart.
The HA implementation is open-sourced at GitHub
Please refer to deployment documentation of HPC Pack 2019.
There are two configurations related to the built-in HA protocol.
ServiceAffinity
(default to 1)ServiceAffinity
means if other HPC services should run on the same node HPC Scheduler runs. The default value is 1, which means affinity is on. This is also the recommended value.- To check the setting, run PowerShell command
Get-HpcClusterRegistry
and findServiceAffinity
in the output. - To change the setting, run PowerShell command
Set-HpcClusterRegistry -PropertyName ServiceAffinity -PropertyValue <new-value>
- After the affinity setting is changed, all head nodes need to be restarted.
HeartBeatTimeOut
(ms, default to 10000)- This is the heartbeat time out used by above algorithm.
- This setting also affect heartbeat interval. Interval is set to
floor(HeartBeatTimeOut / 5)
. So this value can never be set to less then 5, otherwise the SQL server will be flooded by heartbeat message. - The only way to check and change this value is via SQL table
[HPCHAWitness].[dbo].[ParameterTable]
.
Note: It is NOT recommended to change the value of
HeartBeatTimeOut
. The only exception will be during a degradation of network performance of the SQL Always-on servers. When changing this value, make sure all HPC Services are stopped.
Remove the head node that needs to be replaced in cluster manager. Then, install a new head node following the normal HPC HA head node installation process. For more information, refer to the HPC Pack 2019 deployment documentation.
Diagnostic process is the same as other HPC Clusters. If a service failover is abnormal, we will need to check the service log of it.
Check here about how to collect service logs and tools available for reading binary logs.
- In HA clusters, head node servers need to be robust enough to run all the timer instructs in time. This means on head node, CPU load should be less then 90% at all time.
- In HA clusters, we recommend you remove compute node and broker node role from all head nodes.
- HPC clusters, with or without HA, need underlying SQL server working properly to function. If the SQL server is constantly under heavy load, we recommend you to upgrade the server instance. Same for the network between head nodes and SQL server.