HPC cluster deployed in the cloud

Pipelines
Blob Storage
Batch
Monitor

High performance computing (HPC) applications can scale to thousands of compute cores, extend on-premises big compute, or run as a 100% cloud native solution. This HPC solution is built on the Azure managed service: Azure Batch, and initiated by an Azure Pipelines job. These services run in a high-availability environment, patched and supported, allowing you to focus on your solution instead of the environment they run in.

Architecture

Architecture Diagram Download an SVG of this architecture.

The solution involves the following steps:

  1. Azure Pipelines starts a pipeline that compiles the team's code project and stores it as an executable in Azure Storage
  2. The pipeline job continues by loading some processing data into the storage account.
  3. Finally, Azure Pipelines requests that the Azure Batch service initiate its processing job, completing the pipeline.
  4. The Azure Batch service will copy the program executables and input data from storage, and assign it to a pool of compute nodes
  5. The Batch service performs job and task management for the pool, retrying or reassigning tasks as nodes complete their work.
  6. As the compute nodes work, Azure Monitor collects performance data from the pool (CPU, Memory, Disk I/O) and log files. The team can study this telemetry to build better jobs in the future.
  7. When the compute nodes complete tasks, they output their program data back to Azure Storage for the team's review.

Components

  • Azure Pipelines builds and tests code projects, and initiates the HPC jobs on the Azure Batch service.
  • Azure Storage houses HPC data and executable files used in a job.
  • Azure Batch schedules the jobs and tasks across a massive number of nodes, and manages all of the compute resources.
  • Azure Virtual Machines run as workers, performing the compute tasks.
  • Virtual Network provides IP connectivity between the compute resources and the other cloud services, above and beyond any native Infiniband or RDMA communication.
  • Azure Monitor collects performance metrics and logs from the cloud resources for reports, alerting, and automated response.

Considerations

Batch compute pools need not contain commodity hardware. Specialized virtual machines with GPU processors and advanced networking exist to be used by Azure Batch. GPU-optimized virtual machines with NVIDIA Tesla GPUs and high-throughput Infiniband networking are available.

Batch compute pools can autoscale, which grows and shrinks the number of nodes in the pool as the amount of work changes. Rather than paying for unused pool members, autoscaling can reduce the compute costs of a job to only the resources that are performing tasks.

Pricing

To explore the cost of running this scenario, use the Azure pricing calculator, which preconfigures all Azure services.

Azure Batch is a free service, and customers pay only for the underlying virtual machine, storage, and networking costs. In this solution, there are additional costs for the Azure Pipelines and Azure Monitor services. Specifically for Azure Batch, however, is an option to buy graphics rendering software (like Autodesk Maya and Chaos Group V-Ray) at a per-minute rate. See Azure Batch Pricing for details.

Next steps