Windows HPC Server 2008 Management, Monitoring and Diagnostics
I uploaded two more videos on Technet/Edge. They are worth watching if you are interesting in an overview on the Windows HPC Admin Console. The powerful integrated solution enables system administrators to do their common tasks all under one GUI.
Monitoring and managing a large scale cluster often requires advanced tooling. System Administrators demand tools that help them to manage heterogeneous compute nodes, check cluster status at a glance, identify deviance, correlate node and job information, track changes, and the ability to integrate with existing IT infrastructure. Windows HPC Server 2008 admin console addresses all of the above problems with an integrated solution.
The admin Console includes the following five main areas, charts and reports, configuration, node management, job management and diagnostics. In addition, the console has a “pivoting” feature that allows the system administrator to navigate to different views by keeping the same context. Our Program manager Rae Wang, will go through each of the five areas with demonstrations and simple scenarios in the first video, Monitoring and Management .
For large computing clusters, diagnostics is where system administrators spend a lot of their time. Common tasks include:
- Validate cluster post deployment or configuration change.
- Troubleshoot failures.
- Measure performance degradation over time.
Windows HPC Server 2008 has 16 built-in diagnostics to help Sysadmins do diagnostics with ease. These diagnostic tests can be classified into the following categories, infrastructure, configuration report, and performance. Infrastructure tests include scheduler, system services, connectivity, and Service Oriented Architecture or the WCF broker model. While configuration report has application, network, software updates and system service tests available. Finally, we have two MPIPingPong tests that measure the cluster performance in terms of latency and bandwidth.
The diagnostic tests are flexible and easy to run, and the results are filterable and searchable. System administrators can utilize the test results to further diagnose using built-in tools like clusrun, remote desktop, and node template features.