您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

Azure 上的高性能计算 (HPC)High Performance Computing (HPC) on Azure

HPC 简介Introduction to HPC

高性能计算 (HPC) 也称“大型计算”,使用大量基于 CPU 或 GPU 的计算机来解决复杂的数学任务。High Performance Computing (HPC), also called "Big Compute", uses a large number of CPU or GPU-based computers to solve complex mathematical tasks.

许多行业使用 HPC 解决某些最困难的问题。Many industries use HPC to solve some of their most difficult problems. 其中包括下述工作负荷:These include workloads such as:

  • 基因组学Genomics
  • 石油和天然气模拟Oil and gas simulations
  • FinanceFinance
  • 半导体设计Semiconductor design
  • 工程Engineering
  • 天气建模Weather modeling

云上的 HPC 有何不同?How is HPC different on the cloud?

本地 HPC 系统与云中一系统之间的主要区别之一是能够在需要时动态添加和删除资源。One of the primary differences between an on-premises HPC system and one in the cloud is the ability for resources to dynamically be added and removed as they're needed. 动态缩放消除了计算容量这一瓶颈,允许客户根据作业要求调整其基础结构的大小。Dynamic scaling removes compute capacity as a bottleneck and instead allow customers to right size their infrastructure for the requirements of their jobs.

以下文章更详细地介绍了此动态缩放功能。The following articles provide more detail about this dynamic scaling capability.

实现清单Implementation checklist

若要在 Azure 上实现自己的 HPC 解决方案,请确保参阅以下主题:As you're looking to implement your own HPC solution on Azure, ensure you're reviewed the following topics:

  • 按要求选择相应的体系结构Choose the appropriate architecture based on your requirements
  • 了解适用于工作负荷的具体计算选项Know which compute options is right for your workload
  • 根据需求确定适当的存储解决方案Identify the right storage solution that meets your needs
  • 确定如何管理所有资源Decide how you're going to manage all your resources
  • 优化用于云的应用程序Optimize your application for the cloud
  • 保护基础结构Secure your Infrastructure

基础结构Infrastructure

有许多基础结构组件是构建 HPC 系统所必需的。There are a number of infrastructure components necessary to build an HPC system. 不管你选择如何管理 HPC 工作负荷,都需要使用计算、存储和网络基础组件。Compute, Storage, and Networking provide the underlying components, no matter how you choose to manage your HPC workloads.

示例 HPC 体系结构Example HPC architectures

可以通过许多不同的方式在 Azure 上设计和实现 HPC 体系结构。There are a number of different ways to design and implement your HPC architecture on Azure. HPC 应用程序可扩展到数千个计算核心,扩展本地群集或作为 100% 的云原生解决方案来运行。HPC applications can scale to thousands of compute cores, extend on-premises clusters, or run as a 100% cloud-native solution.

以下方案概述了生成 HPC 解决方案的一些常见方式。The following scenarios outline a few of the common ways HPC solutions are built.

计算Compute

Azure 提供一系列已针对 CPU 和 GPU 密集型工作负荷进行优化的虚拟机大小。Azure offers a range of sizes that are optimized for both CPU & GPU intensive workloads.

基于 CPU 的虚拟机CPU-based virtual machines

支持 GPU 的虚拟机GPU-enabled virtual machines

N 系列的 VM 具备为计算密集型或图形密集型应用程序(包括人工智能 (AI) 学习和可视化)设计的 NVIDIA GPU。N-series VMs feature NVIDIA GPUs designed for compute-intensive or graphics-intensive applications including artificial intelligence (AI) learning and visualization.

存储Storage

大规模的批处理和 HPC 工作负荷具有超过传统云文件系统功能的数据存储和访问需求。Large-scale Batch and HPC workloads have demands for data storage and access that exceed the capabilities of traditional cloud file systems. 可以通过许多解决方案来管理 Azure 上的 HPC 应用程序的速度和容量需求There are a number of solutions to manage both the speed and capacity needs of HPC applications on Azure

有关在 Azure 上比较 Lustre、GlusterFS 和 BeeGFS 的详细信息,请查看 Azure 电子书 上的并行文件系统Azure 上的 Lustre 博客。For more information comparing Lustre, GlusterFS, and BeeGFS on Azure, review the Parallel Files Systems on Azure e-book and the Lustre on Azure blog.

网络Networking

H16r、H16mr、A8、A9 VM 可以连接到高吞吐量后端 RDMA 网络。H16r, H16mr, A8, and A9 VMs can connect to a high throughput back-end RDMA network. 此网络可以提高在 Microsoft MPI 或 Intel MPI 下运行的紧密耦合的并行应用程序的性能。This network can improve the performance of tightly coupled parallel applications running under Microsoft MPI or Intel MPI.

管理Management

DIYDo-it-yourself

在 Azure 上从头开头构建 HPC 系统可以为你带来极大的灵活性,但通常需要进行很密集的维护。Building an HPC system from scratch on Azure offers a significant amount of flexibility, but is often very maintenance intensive.

  1. 在 Azure 虚拟机或虚拟机规模集中设置自己的群集环境。Set up your own cluster environment in Azure virtual machines or virtual machine scale sets.
  2. 使用 Azure 资源管理器模板部署先进的工作负荷管理器、基础结构和应用程序Use Azure Resource Manager templates to deploy leading workload managers, infrastructure, and applications.
  3. 选择包括 MPI 或 GPU 工作负荷专用硬件与网络连接的 HPC 和 GPU VM 大小Choose HPC and GPU VM sizes that include specialized hardware and network connections for MPI or GPU workloads.
  4. 为 I/O 密集型工作负载添加高性能存储Add high performance storage for I/O-intensive workloads.

混合和云突发Hybrid and cloud Bursting

如果你有一个现有的本地 HPC 系统要连接到 Azure,则可以使用许多资源来帮助你入门。If you have an existing on-premises HPC system that you'd like to connect to Azure, there are a number of resources to help get you started.

首先,请参阅文档中的将本地网络连接到 Azure 的选项一文。First, review the Options for connecting an on-premises network to Azure article in the documentation. 可以在其中获取下述连接选项的相关信息:From there, you may want information on these connectivity options:

安全地建立网络连接以后,即可通过现有工作负荷管理器的突发功能根据需要使用云计算资源。Once network connectivity is securely established, you can start using cloud compute resources on-demand with the bursting capabilities of your existing workload manager.

市场解决方案Marketplace solutions

Azure 市场中提供许多工作负荷管理器。There are a number of workload managers offered in the Azure Marketplace.

Azure BatchAzure Batch

Azure Batch 是一种平台服务,用于在云中高效运行大规模并行和高性能计算 (HPC) 应用程序。Azure Batch is a platform service for running large-scale parallel and high-performance computing (HPC) applications efficiently in the cloud. Azure Batch 可以计划要在托管的虚拟机池上运行的计算密集型工作,并且可以自动缩放计算资源以符合作业的需求。Azure Batch schedules compute-intensive work to run on a managed pool of virtual machines, and can automatically scale compute resources to meet the needs of your jobs.

SaaS 提供商或开发商可以使用 Batch SDK 和工具将 HPC 应用程序或容器工作负荷与 Azure 集成,将数据暂存到 Azure,并生成作业执行管道。SaaS providers or developers can use the Batch SDKs and tools to integrate HPC applications or container workloads with Azure, stage data to Azure, and build job execution pipelines.

Azure CycleCloudAzure CycleCloud

Azure CycleCloud 提供的在 Azure 上使用任何计划程序(例如 Slurm、Grid Engine、HPC Pack、HTCondor、LSF、PBS Pro 或 Symphony)管理 HPC 工作负荷的方法是最简单的Azure CycleCloud Provides the simplest way to manage HPC workloads using any scheduler (like Slurm, Grid Engine, HPC Pack, HTCondor, LSF, PBS Pro, or Symphony), on Azure

CycleCloud 允许执行以下操作:CycleCloud allows you to:

  • 部署全部群集和其他资源,包括计划程序、计算 VM、存储、网络和缓存Deploy full clusters and other resources, including scheduler, compute VMs, storage, networking, and cache
  • 协调作业、数据和云工作流Orchestrate job, data, and cloud workflows
  • 允许管理员完全控制运行作业的具体用户、位置和成本Give admins full control over which users can run jobs, as well as where and at what cost
  • 通过高级策略和治理功能(例如成本控制、Active Directory 集成、监视和报告)自定义并优化群集Customize and optimize clusters through advanced policy and governance features, including cost controls, Active Directory integration, monitoring, and reporting
  • 无需修改即可使用当前的作业计划程序和应用程序Use your current job scheduler and applications without modification
  • 利用内置的自动缩放和经过测试的引用体系结构处理广泛的 HPC 工作负荷和行业工作负荷Take advantage of built-in autoscaling and battle-tested reference architectures for a wide range of HPC workloads and industries

工作负荷管理器Workload managers

下面是可在 Azure 基础结构中运行的群集和工作负荷管理器示例。The following are examples of cluster and workload managers that can run in Azure infrastructure. 在 Azure VM 中创建独立的群集,或从本地群集迸发到 Azure VM。Create stand-alone clusters in Azure VMs or burst to Azure VMs from an on-premises cluster.

容器Containers

也可使用容器来管理某些 HPC 工作负荷。Containers can also be used to manage some HPC workloads. 可以使用 Azure Kubernetes 服务 (AKS) 之类的服务在 Azure 中轻松地部署托管的 Kubernetes 群集。Services like the Azure Kubernetes Service (AKS) makes it simple to deploy a managed Kubernetes cluster in Azure.

成本管理Cost management

可以通过多种不同的方式管理 Azure 上的 HPC 成本。Managing your HPC cost on Azure can be done through a few different ways. 确保已查看 Azure 购买选项,找出最适合自己组织的方法。Ensure you've reviewed the Azure purchasing options to find the method that works best for your organization.

安全性Security

若要大致了解 Azure 上有关安全性的最佳做法,请参阅 Azure 安全性文档For an overview of security best practices on Azure, review the Azure Security Documentation.

除了云突发部分提供的网络配置,可能还需要实现一项中心辐射型配置,以便隔离计算资源:In addition to the network configurations available in the Cloud Bursting section, you may want to implement a hub/spoke configuration to isolate your compute resources:

HPC 应用程序HPC applications

在 Azure 中运行自定义 HPC 应用程序或商业 HPC 应用程序。Run custom or commercial HPC applications in Azure. 本部分中的几个示例已成为使用更多 VM 或计算核心高效进行缩放的基准。Several examples in this section are benchmarked to scale efficiently with additional VMs or compute cores. 请访问 Azure 市场获取随时可供部署的解决方案。Visit the Azure Marketplace for ready-to-deploy solutions.

备注

请向商业应用程序的供应商咨询有关在云中运行的许可或其他限制。Check with the vendor of any commercial application for licensing or other restrictions for running in the cloud. 并非所有供应商都提供即用即付许可。Not all vendors offer pay-as-you-go licensing. 可能需要云中有一个用于自己的解决方案的许可服务器,或连接到本地许可证服务器。You might need a licensing server in the cloud for your solution, or connect to an on-premises license server.

工程应用程序Engineering applications

图形和渲染Graphics and rendering

AI 和深度学习AI and deep learning

MPI 提供程序MPI Providers

远程可视化Remote visualization

性能基准测试Performance Benchmarks

客户案例Customer stories

许多客户在将 Azure 用于其 HPC 工作负荷时都很成功。There are a number of customers who have seen great success by using Azure for their HPC workloads. 可以在下面找到多个这样的客户案例研究:You can find a few of these customer case studies below:

其他重要信息Other important information

  • 确保在尝试运行大规模工作负荷之前已提高 vCPU 配额Ensure your vCPU quota has been increased before attempting to run large-scale workloads.

后续步骤Next steps

有关最新公告,请参阅:For the latest announcements, see:

Microsoft Batch 示例Microsoft Batch Examples

这些教程将详细介绍如何在 Microsoft Batch 上运行应用程序These tutorials will provide you with details on running applications on Microsoft Batch