HPC Pack Mesos integration step by step

Introduce HPC Pack Mesos framework

HPC Pack Mesos framework is a Mesos scheduler framework, which accepts offers from Mesos master and builds HPC Pack compute nodes for existing HPC Pack cluster. With the help of HPC Pack Mesos framework, resource allocation of Microsoft HPC Pack cluster can be managed by existing Mesos cluster, which increases resource utilization. HPC Pack Mesos framework:

  • Borrows HPC Pack compute nodes from Mesos cluster, if
    • HPC Pack has queueing tasks need more resource
    • Mesos cluster has available resource for HPC Pack
  • Returns HPC Pack compute nodes to Mesos cluster, if
    • The node reached idle time out of Mesos framework

Prerequisite

  1. A Mesos cluster is built and properly configured. See Mesos Getting Started.
  2. HPC Pack 2016 Update 2 or later version is installed on a domain-joined HPC Head Node.
  3. Domain-joined Windows Server nodes in Mesos cluster, which will act as HPC Pack Compute Node later ("Mesos Compute Nodes").
  4. Having python 2.7 and pipenv on the node which will host HPC Pack Mesos framework ("Mesos Framework Node").

Set up HPC Pack Mesos framework

  1. On Mesos Framework Node, clone the repository and install all dependencies
git clone https://github.com/Azure/hpcpack-mesos.git
cd hpcpack-mesos
pipenv install
  1. Copy daemon.ps1 and setupscript.ps1 in hpcpack-mesos folder into local folder on Mesos Compute Nodes (recommended), or into a share folder can be accessed from Mesos Compute Nodes.

    Note: If these two scripts are put in a share folder, you may have to change PowerShell Execution Policy to Unrestricted. See About Execution Policies.

  2. In elevated command prompt, start mesos-agent.exe on Mesos Compute Nodes with following parameters.

    --attributes=os:windows_server;cores:<core_number>[;node_group:<node_group_name>] --hostname=<hostname>
    

    Parameters detail:

    Name Description Example
    core_number The max core number on current node 8
    node_group_name (Optional) The node group current HPC Pack Mesos framework instance should try to grow nodes into NodeGroup1
    hostname Hostname of current node IaaSCN000

    Complete example of growing for default node group:

    --attributes=os:windows_server;cores:8 --hostname=IaaSCN000
    

    Complete example of growing for node group NodeGroup1:

    --attributes=os:windows_server;cores:8;node_group:NodeGroup1 --hostname=IaaSCN000
    
  3. Get the .pfx cert file which is selected when installing HPC Pack head nodes and compute nodes from your HPC Cluster admin. Convert the .pfx file to .pem file using OpenSSL as client cert which we will use later:

    openssl pkcs12 -in file.pfx -out file.pem -nodes
    
  4. Start HPC Pack Mesos framework on Mesos Framework Node using

pipenv run python hpcframework.py [-h][-g NODE_GROUP] script_path setup_path headnode ssl_thumbprint client_cert

Positional parameters detail:

Name Description
script_path Path of HPC Pack Mesos slave setup script
setup_path Path of HPC Pack setup executable
headnode Hostname of HPC Pack cluster head node
ssl_thumbprint Thumbprint of certificate, which will be used in installation and communication with HPC Pack cluster
client_cert .pem file of client cert used for HPC Management REST API authentication. Generated in step 4.

Optional parameters detail: | Name | Description | | -------------------------------------- | ------------------------------------------------------- | | -h | Show script help information | | -g NODE_GROUP, --node_group NODE_GROUP | The node group in which we need to perform grow-shrink. | Complete example:

pipenv run python hpcframework.py "setupscript.ps1" "setup.exe" "hpcheadnode" "0386B1198B956BBAAA4154153B6CA1F44B6D1016" "HPC2016Comm.pem"

Validate HPC Pack Mesos framework

To validate if HPC Pack Mesos framework is properly set, using following steps:

  1. Find the node group into which Mesos framework will grow compute nodes.
  2. Take all nodes in that node group offline.
  3. Submit a job which requires to run on the specified node group.
  4. If there are Mesos offers available, HPC Pack compute nodes will join the cluster in several minutes for the first time.
  5. Once the cluster is gown large enough for the job, it will start the workload.
  6. After all the jobs on that node is completed, the node will be return to Mesos cluster after being idle for 3 minutes.

Reference

HPC Pack Mesos framework Project Home