ACM Pack Portal for Azure HPC Cluster

HPC ACM(Azure Cluster Management) portal aims to provide friendly interface for Azure HPC cluster diagnostics, cluster run, cluster monitoring and management.

Cluster Diagnostics

Run different categories of diagnostic tests to generate reports for issue analysis.

Diagnostics table

Diagnostics table displays all diagnostic tests by job created time. You can customize list columns, by default, the list displays job's id, created time, last changed time, diagnostic name, test category, test name, real-time job state and progress. Diagnostics table provides a very convenient way to track real-time state and progress of all test jobs.

diagnostics list

How to use portal to run diagnostic tests

  1. Select nodes in resource table and click 'Run Diagnostics' button.

    resource list

  2. Select test, input parameters and test name

    tests

  3. Waiting for test result.

    test result

Diagnostic result

Every diagnostic test result can be divided into two parts, overview and tasks. Overview mainly gives the diagnostic report, it also includes job nodes information, error message and job events. Tasks display all task information in one diagnostic job, task detail window shows task output.

Following take pingpong test for instance to view diagnostic result.

Overview

Use information in overview to understand aggregation result of test to find hidden issue.

  1. Connectivity: Connectivity shows the results of running MPI Pingpong in node pairs, including latency, throughput and runtime for each of them.

    connectivity

  2. Latency: Latency information is given in two modes, overview mode shows all node pairs latency information in one diagnostic job and node mode shows specified node pairs latency information which selected node connects with. Both modes show latency range in histogram by node pairs number, show detail latency report information includes Passed, Packet Size, Threshold, Average, Median, Standard Deviation and Variability, show pairs information based on Best Pairs, Worst Pairs and Bad Pairs.

    Overview Mode

    overview mode latency

    Node Mode

    node mode latency

  3. Throughput: Throughput information is given in two modes, overview mode shows all node pairs throughput information in one diagnostic job and node mode shows specified node pairs throughput information which selected node connects with. Both modes show throughput range in histogram by node pairs number, show detail throughput report information includes Passed, Packet Size, Threshold, Average, Median, Standard Deviation and Variability, show pairs information based on Best Pairs, Worst Pairs and Bad Pairs.

    overview mode

    overview mode throughput

    Node Mode

    node mode throughput

  4. Node Diagnostic: Node diagnostic gives nodes failed reasons in two mode, overview mode and node mode. Overview mode shows all failed reasons aggregated in one diagnostic test, every failed reason may have one solution and display all involved nodes or node pairs. Node mode shows selected node failed reasons if it has.

    Overview Mode

    overview node diagnostic

    Node Mode

    node mode diagnostic

  5. Nodes Groups: Show nodes groups in which nodes connect with each other and can get pingpong test result.

    nodes groups

  6. Nodes: Display all nodes which run the diagnostic test. In pingpong test, if one node is not connected with other nodes, we define this node is bad node which show in red color in nodes tab.

    nodes

  7. Events: Show job events during job runtime if it has.

    events

  8. Error: If job aggregation result is not generated and error information will return.

    error

Tasks

Show all tasks in one diagnostic job in task table, which columns are task id, allocated nodes, state, remark and detail.

tasks

Click detail button in task table, task detail window shows task output.

task output

Cluster Run

Run command to interact with cluster nodes and get nodes' real-time response.

Cluster run table

Cluster run table's items sorted by job created time. You can customize list columns, by default, the table displays job's id, created time, last changed time, command content, job state and job progress. Here you can view the real-time state and progress of cluster run jobs.

cluster run list

How to use portal to create cluster run job

  1. Select nodes in resource table and click 'Run Command' button.

    resource list

  2. Input command text in popup window, select command type and result view mode.

    Single Line Command

    new cluster run

    Script Block for Linux

    new cluster run

  3. Waiting for test result. In multiple commands view, new commands could be run with the same nodes selected before.

    Single Command View

    clusrun result

    Multiple Commands View

    clusrun result

Cluster run result

Cluster run result is showed by node, click one item in nodes' table, corresponding result executed command will show in right console and you could download command's whole output. If error happens in one node when getting command executed result, the node's name in nodes table will highlight in red color, and click the error node name, a window will popup to show the detailed error message.

clusrun result

clusrun result

clusrun error

Azure HPC Cluster Monitoring

View properties of cluster nodes and jobs and statistics of high level to monitor cluster condition.

Dashboard

Dashboard shows cluster nodes and jobs at a glance. Nodes have three states, OK, WARNING and ERROR. Jobs could be divided into diagnostics and clusrun, each of them shows jobs by different job states in histogram, the job states include Queued, Running, Finishing, Finished, Canceling, Canceled, Failed.

dashboard

Resource

Resource Table

You can view cluster nodes intuitively and track nodes condition quickly via resource table. Resource table shows node name, state, OS, active Jobs count and memory, you also can customize table columns.

resource table

Node Information

Node Information includes basic information, nodes events and active jobs.

  1. Basic Info: Display CPU utilization, node metadata, network information and node registration information.

    node basic information

  2. Events: Display azure scheduled events.

    events

  3. Active Jobs: Display active jobs running in this node which include diagnostics and clusrun.

    active jobs

Heatmap

Heatmap gives a big picture of real-time resource utilization in cluster, now it only supports cpu metric. In the heatmap, the darker color the node display, the more resource the node utilize. Hover on the heatmap node square, the node name and cpu usage percent value will show in tooltip, and click on one node square, you will be navigated to the node's detail page.

heatmap