您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

自动缩放 Azure HDInsight 群集Automatically scale Azure HDInsight clusters

Azure HDInsight 的 "免费自动缩放" 功能可根据先前设置的条件自动增加或减少群集中的辅助角色节点数。Azure HDInsight's free Autoscale feature can automatically increase or decrease the number of worker nodes in your cluster based on previously set criteria. 在群集创建过程中,可以设置最小和最大节点数,使用日期时间计划或特定性能指标建立缩放条件,而 HDInsight 平台会执行其他任务。You set a minimum and maximum number of nodes during cluster creation, establish the scaling criteria using a day-time schedule or specific performance metrics, and the HDInsight platform does the rest.

工作原理How it works

自动缩放功能使用两种类型的条件来触发缩放事件:不同群集性能指标的阈值(称为基于负载的缩放)和基于时间的触发器(称为基于计划的缩放)。The Autoscale feature uses two types of conditions to trigger scaling events: thresholds for various cluster performance metrics (called load-based scaling) and time-based triggers (called schedule-based scaling). 基于负载的缩放会在你设置的范围内更改群集中的节点数,以确保 CPU 使用最佳并且最大程度地降低运行成本。Load-based scaling changes the number of nodes in your cluster, within a range that you set, to ensure optimal CPU usage and minimize running cost. 基于计划的缩放根据与特定日期和时间关联的操作更改群集中的节点数。Schedule-based scaling changes the number of nodes in your cluster based on operations that you associate with specific dates and times.

以下视频概述了自动缩放解决的难题,以及它如何帮助你使用 HDInsight 控制成本。The following video provides an overview of the challenges which Autoscale solves and how it can help you to control costs with HDInsight.

选择基于负载或基于计划的缩放Choosing load-based or schedule-based scaling

选择缩放类型时,请考虑以下因素:Consider the following factors when choosing a scaling type:

  • 负载变化:群集的负载是否在特定的时间内按特定的时间使用一致的模式?Load variance: does the load of the cluster follow a consistent pattern at specific times, on specific days? 如果不是,则最好使用基于负载的计划。If not, load based scheduling is a better option.
  • SLA 要求:自动缩放缩放为反应性,而不是预测。SLA requirements: Autoscale scaling is reactive instead of predictive. 在负载开始增加以后,是否有足够的延迟来确保将群集设置为目标大小?Will there be a sufficient delay between when the load starts to increase and when the cluster needs to be at its target size? 如果存在严格的 SLA 要求并且负载为固定的已知模式,则 "基于计划" 是一个更好的选择。If there are strict SLA requirements and the load is a fixed known pattern, 'schedule based' is a better option.

群集指标Cluster metrics

自动缩放会持续监视群集并收集以下指标:Autoscale continuously monitors the cluster and collects the following metrics:

指标Metric 说明Description
总待处理 CPUTotal Pending CPU 开始执行所有待处理容器所需的核心总数。The total number of cores required to start execution of all pending containers.
总待处理内存Total Pending Memory 开始执行所有待处理容器所需的总内存(以 MB 为单位)。The total memory (in MB) required to start execution of all pending containers.
总可用 CPUTotal Free CPU 活动工作节点上所有未使用核心的总和。The sum of all unused cores on the active worker nodes.
总可用内存Total Free Memory 活动工作节点上未使用内存的总和(以 MB 为单位)。The sum of unused memory (in MB) on the active worker nodes.
每个节点的已使用内存Used Memory per Node 工作节点上的负载。The load on a worker node. 使用了 10 GB 内存的工作节点的负载被认为比使用了 2 GB 内存的工作节点的负载更大。A worker node on which 10 GB of memory is used, is considered under more load than a worker with 2 GB of used memory.
每个节点的应用程序主机数Number of Application Masters per Node 在工作节点上运行的应用程序主机 (AM) 容器的数量。The number of Application Master (AM) containers running on a worker node. 托管两个 AM 容器的工作节点被认为比托管零个 AM 容器的工作节点更重要。A worker node that is hosting two AM containers, is considered more important than a worker node that is hosting zero AM containers.

每 60 秒检查一次上述指标。The above metrics are checked every 60 seconds. 可以使用任何这些指标设置群集的缩放操作。You can setup scaling operations for your cluster using any of these metrics.

基于负载的缩放条件Load-based scale conditions

检测到以下情况时,自动缩放将发出缩放请求:When the following conditions are detected, Autoscale will issue a scale request:

纵向扩展Scale-up 纵向缩减Scale-down
总待处理 CPU 大于总可用 CPU 的时间超过 3 分钟。Total pending CPU is greater than total free CPU for more than 3 minutes. 总待处理 CPU 小于总可用 CPU 的时间超过 10 分钟。Total pending CPU is less than total free CPU for more than 10 minutes.
总待处理内存大于总可用内存的时间超过 3 分钟。Total pending memory is greater than total free memory for more than 3 minutes. 总待处理内存小于总可用内存的时间超过 10 分钟。Total pending memory is less than total free memory for more than 10 minutes.

对于纵向扩展,自动缩放会发出扩展请求,以添加所需数量的节点。For scale-up, Autoscale issues a scale-up request to add the required number of nodes. 向上缩放基于所需的新工作节点数量,以满足当前的 CPU 和内存要求。The scale-up is based on how many new worker nodes are needed to meet the current CPU and memory requirements.

对于向下缩放,自动缩放会发出删除一定数量节点的请求。For scale-down, Autoscale issues a request to remove a certain number of nodes. 根据每个节点的 AM 容器数向下缩放。The scale-down is based on the number of AM containers per node. 以及当前的 CPU 和内存要求。And the current CPU and memory requirements. 此服务还会根据当前作业执行情况,检测待删除的节点。The service also detects which nodes are candidates for removal based on current job execution. 纵向缩减操作首先关闭节点,然后将其从群集中删除。The scale down operation first decommissions the nodes, and then removes them from the cluster.

群集兼容性Cluster compatibility

重要

Azure HDInsight 自动缩放功能于 2019 年 11 月 7 日正式发布,适用于 Spark 和 Hadoop 群集,并包含了该功能预览版本中未提供的改进。The Azure HDInsight Autoscale feature was released for general availability on November 7th, 2019 for Spark and Hadoop clusters and included improvements not available in the preview version of the feature. 如果你在 2019 年 11 月 7 日之前创建了 Spark 群集,并希望在群集上使用自动缩放功能,我们建议创建新群集,并在新群集上启用自动缩放。If you created a Spark cluster prior to November 7th, 2019 and want to use the Autoscale feature on your cluster, the recommended path is to create a new cluster, and enable Autoscale on the new cluster.

交互式查询 (LLAP) 和 HBase 群集的自动缩放功能仍处于预览阶段。Autoscale for Interactive Query (LLAP) and HBase clusters is still in preview. 自动缩放仅适用于 Spark、Hadoop、交互式查询和 HBase 群集。Autoscale is only available on Spark, Hadoop, Interactive Query, and HBase clusters.

下表描述了与自动缩放功能兼容的群集类型和版本。The following table describes the cluster types and versions that are compatible with the Autoscale feature.

VersionVersion SparkSpark HiveHive LLAPLLAP HBaseHBase KafkaKafka StormStorm MLML
不包含 ESP 的 HDInsight 3.6HDInsight 3.6 without ESP Yes Yes Yes 是*Yes* No No No
不包含 ESP 的 HDInsight 4.0HDInsight 4.0 without ESP Yes Yes Yes 是*Yes* No No No
包含 ESP 的 HDInsight 3.6HDInsight 3.6 with ESP Yes Yes Yes 是*Yes* No No No
包含 ESP 的 HDInsight 4.0HDInsight 4.0 with ESP Yes Yes Yes 是*Yes* No No No

*只能将 HBase 群集配置为基于计划的缩放,而不能配置基于负载的群集。* HBase clusters can only be configured for schedule-based scaling, not load-based.

入门Get started

使用基于负载的自动缩放创建群集Create a cluster with load-based Autoscaling

若要结合基于负载的缩放启用自动缩放功能,请在创建普通群集的过程中完成以下步骤:To enable the Autoscale feature with load-based scaling, complete the following steps as part of the normal cluster creation process:

  1. 在 "配置 + 定价" 选项卡上,选中 "启用自动缩放" 复选框。On the Configuration + pricing tab, select the Enable autoscale checkbox.

  2. 在“自动缩放类型”下选择“基于负载”。********Select Load-based under Autoscale type.

  3. 为以下属性输入预期值:Enter the intended values for the following properties:

    • 适用于工作器节点的初始工作节点数****。****Initial Number of nodes for Worker node.
    • 工作器节点最小数目。Min number of worker nodes.
    • 工作器节点最大数目。Max number of worker nodes.

    启用工作器节点的基于负载的自动缩放

工作节点的初始数量必须介于最小值和最大值之间(含最大值和最小值)。The initial number of worker nodes must fall between the minimum and maximum, inclusive. 此值定义创建群集时的群集初始大小。This value defines the initial size of the cluster when it's created. 工作器节点最小数目至少应设置为 3。The minimum number of worker nodes should be set to three or more. 将群集缩放成少于三个节点可能导致系统停滞在安全模式下,因为没有进行充分的文件复制。Scaling your cluster to fewer than three nodes can result in it getting stuck in safe mode because of insufficient file replication. 有关详细信息,请参阅进入安全模式For more information, see Getting stuck in safe mode.

使用基于计划的自动缩放创建群集Create a cluster with schedule-based Autoscaling

若要结合基于计划的缩放启用自动缩放功能,请在创建普通群集的过程中完成以下步骤:To enable the Autoscale feature with schedule-based scaling, complete the following steps as part of the normal cluster creation process:

  1. 在“配置 + 定价”选项卡上,勾选“启用自动缩放”复选框。********On the Configuration + pricing tab, check the Enable autoscale checkbox.

  2. 输入工作器节点节点数,以控制纵向扩展群集的限制。Enter the Number of nodes for Worker node, which controls the limit for scaling up the cluster.

  3. 在“自动缩放类型”下选择“基于计划”选项。********Select the option Schedule-based under Autoscale type.

  4. 选择 "配置" 以打开 "自动缩放配置" 窗口。Select Configure to open the Autoscale configuration window.

  5. 选择时区,然后单击“+ 添加条件”****Select your timezone and then click + Add condition

  6. 选择新条件要应用到的星期日期。Select the days of the week that the new condition should apply to.

  7. 编辑该条件生效的时间,以及群集要缩放到的节点数。Edit the time the condition should take effect and the number of nodes that the cluster should be scaled to.

  8. 根据需要添加更多条件。Add more conditions if needed.

    启用工作器节点的基于计划的创建

节点数最小为 3,最大为添加条件之前输入的最大工作器节点数。The number of nodes must be between 3 and the maximum number of worker nodes that you entered before adding conditions.

最终创建步骤Final creation steps

通过从 "节点大小" 下的下拉列表中选择一个 vm,选择辅助角色节点的 vm 类型。Select the VM type for worker nodes by selecting a VM from the drop-down list under Node size. 为每个节点类型选择 VM 类型后,可以看到整个群集的估算成本范围。After you choose the VM type for each node type, you can see the estimated cost range for the whole cluster. 请根据预算调整 VM 类型。Adjust the VM types to fit your budget.

启用工作器节点的基于计划的自动缩放节点大小

你的订阅具有针对每个区域的容量配额。Your subscription has a capacity quota for each region. 头节点的内核总数和最大工作节点数不能超过容量配额。The total number of cores of your head nodes and the maximum worker nodes can't exceed the capacity quota. 但是,此配额是软性限制;始终可创建支持票证来轻松地增加此配额。However, this quota is a soft limit; you can always create a support ticket to get it increased easily.

备注

如果超出总核心配额限制,将收到一条错误消息,指出“最大节点数超出此区域中的可用核心数,请选择其他区域或联系客户支持以增加配额”。If you exceed the total core quota limit, You will receive an error message saying 'the maximum node exceeded the available cores in this region, please choose another region or contact the support to increase the quota.'

有关使用 Azure 门户创建 HDInsight 群集的详细信息,请参阅使用 Azure 门户在 HDInsight 中创建基于 Linux 的群集For more information on HDInsight cluster creation using the Azure portal, see Create Linux-based clusters in HDInsight using the Azure portal.

使用资源管理器模板创建群集Create a cluster with a Resource Manager template

基于负载的自动缩放Load-based autoscaling

可以使用 Azure 资源管理器模板创建支持基于负载的自动缩放的 HDInsight 群集,方法是将 autoscale 节点添加到包含属性 minInstanceCountmaxInstanceCountcomputeProfile > workernode 节,如以下 JSON 代码片段所示。You can create an HDInsight cluster with load-based Autoscaling an Azure Resource Manager template, by adding an autoscale node to the computeProfile > workernode section with the properties minInstanceCount and maxInstanceCount as shown in the json snippet below.

{
  "name": "workernode",
  "targetInstanceCount": 4,
  "autoscale": {
      "capacity": {
          "minInstanceCount": 3,
          "maxInstanceCount": 10
      }
  },
  "hardwareProfile": {
      "vmSize": "Standard_D13_V2"
  },
  "osProfile": {
      "linuxOperatingSystemProfile": {
          "username": "[parameters('sshUserName')]",
          "password": "[parameters('sshPassword')]"
      }
  },
  "virtualNetworkProfile": null,
  "scriptActions": []
}

基于计划的自动缩放Schedule-based autoscaling

可以使用 Azure 资源管理器模板创建支持基于计划的自动缩放的 HDInsight 群集,方法是将 autoscale 节点添加到 computeProfile > workernode 节。You can create an HDInsight cluster with schedule-based Autoscaling an Azure Resource Manager template, by adding an autoscale node to the computeProfile > workernode section. autoscale 节点包含 recurrence,其中的 timezoneschedule 描述了更改生效的时间。The autoscale node contains a recurrence that has a timezone and schedule that describes when the change will take place.

{
  "autoscale": {
    "recurrence": {
      "timeZone": "Pacific Standard Time",
      "schedule": [
        {
          "days": [
            "Monday",
            "Tuesday",
            "Wednesday",
            "Thursday",
            "Friday"
          ],
          "timeAndCapacity": {
            "time": "11:00",
            "minInstanceCount": 10,
            "maxInstanceCount": 10
          }
        }
      ]
    }
  },
  "name": "workernode",
  "targetInstanceCount": 4
}

为正在运行的群集启用和禁用自动缩放Enable and disable Autoscale for a running cluster

使用 Azure 门户Using the Azure portal

若要在运行中的群集上启用自动缩放,请选择“设置”下的“群集大小”。********To enable Autoscale on a running cluster, select Cluster size under Settings. 然后选择 "启用自动缩放"。Then select Enable autoscale. 选择所需的自动缩放类型,然后输入基于负载或基于计划的缩放选项。Select the type of Autoscale that you want and enter the options for load-based or schedule-based scaling. 最后,选择“保存”****。Finally, select Save.

启用工作器节点的基于计划的自动缩放运行群集

使用 REST APIUsing the REST API

若要使用 REST API 在正在运行的群集上启用或禁用自动缩放,请将 POST 请求发送到自动缩放终结点:To enable or disable Autoscale on a running cluster using the REST API, make a POST request to the Autoscale endpoint:

https://management.azure.com/subscriptions/{subscription Id}/resourceGroups/{resourceGroup Name}/providers/Microsoft.HDInsight/clusters/{CLUSTERNAME}/roles/workernode/autoscale?api-version=2018-06-01-preview

请在请求有效负载中使用适当的参数。Use the appropriate parameters in the request payload. 下面的 json 有效负载可以用来启用自动缩放。The json payload below could be used to enable Autoscale. 使用有效负载 {autoscale: null} 禁用自动缩放。Use the payload {autoscale: null} to disable Autoscale.

{ "autoscale": { "capacity": { "minInstanceCount": 3, "maxInstanceCount": 5 } } }

请参阅介绍如何启用基于负载的自动缩放的上一部分,详尽了解所有的有效负载参数。See the previous section on enabling load-based autoscale for a full description of all payload parameters.

监视自动缩放活动Monitoring Autoscale activities

群集状态Cluster status

Azure 门户中列出的群集状态可帮助你监视自动缩放活动。The cluster status listed in the Azure portal can help you monitor Autoscale activities.

启用工作器节点的基于负载的自动缩放群集状态

以下列表解释了你可能会看到的所有群集状态消息。All of the cluster status messages that you might see are explained in the list below.

群集状态Cluster status 说明Description
正在运行Running 群集在正常运行。The cluster is operating normally. 所有以前的自动缩放活动已成功完成。All of the previous Autoscale activities have completed successfully.
更新Updating 正在更新群集自动缩放配置。The cluster Autoscale configuration is being updated.
HDInsight 配置HDInsight configuration 某个群集纵向扩展或缩减操作正在进行。A cluster scale up or scale down operation is in progress.
更新时出错Updating Error HDInsight 在自动缩放配置更新期间遇到问题。HDInsight met issues during the Autoscale configuration update. 客户可以选择重试更新或禁用自动缩放。Customers can choose to either retry the update or disable autoscale.
错误Error 群集出现问题,无法使用。Something is wrong with the cluster, and it isn't usable. 请删除此群集,然后新建一个。Delete this cluster and create a new one.

若要查看群集中节点的当前数目,请在群集的 "概述" 页上,中转到 "群集大小" 图表。To view the current number of nodes in your cluster, go to the Cluster size chart on the Overview page for your cluster. 或在 "设置" 下选择群集大小Or select Cluster size under Settings.

操作历史记录Operation history

可查看群集指标中包含的群集增加和减少历史记录。You can view the cluster scale-up and scale-down history as part of the cluster metrics. 还可以列出过去一天、过去一周或其他时间段的所有缩放操作。You can also list all scaling actions over the past day, week, or other period of time.

选择 "监视" 下的度量值Select Metrics under Monitoring. 然后从 "指标" 下拉框中选择 "添加度量值" 和 "活动工作线程数"。Then select Add metric and Number of Active Workers from the Metric dropdown box. 选择右上角的按钮以更改时间范围。Select the button in the upper right to change the time range.

启用工作器节点的基于计划的自动缩放指标

其他注意事项Other considerations

请考虑纵向扩展或纵向缩减操作的延迟Consider the latency of scale up or scale down operations

完成一项缩放操作可能需要 10 到 20 分钟。It can take 10 to 20 minutes for a scaling operation to complete. 设置自定义计划时,请将此延迟计划在内。When setting up a customized schedule, plan for this delay. 例如,如果需要在早晨 9:00 将群集大小设置为 20,请将计划触发器设置为更早的某个时间(例如早晨 8:30),这样缩放操作就可以在早晨 9:00 之前完成。For example, if you need the cluster size to be 20 at 9:00 AM, set the schedule trigger to an earlier time such as 8:30 AM so that the scaling operation has completed by 9:00 AM.

准备纵向缩减Preparation for scaling down

在群集纵向缩减过程中,自动缩放会根据目标大小解除节点的授权。During cluster scaling down process, Autoscale will decommission the nodes to meet the target size. 如果任务在这些节点上运行,自动缩放将等待,直到任务完成。If tasks are running on those nodes, Autoscale will wait until the tasks are completed. 由于每个工作器节点也充当 HDFS 中的某个角色,因此会将临时数据转移到剩余节点中。Since each worker node also serves a role in HDFS, the temp data will be shifted to the remaining nodes. 因此,您应该确保剩余节点上有足够的空间来承载所有临时数据。So you should make sure there's enough space on the remaining nodes to host all the temp data.

正在运行的作业将继续。The running jobs will continue. 等待作业将等待计划的可用辅助角色节点更少。The pending jobs will wait for scheduling with fewer available worker nodes.

最小的群集大小Minimum cluster size

不要将群集缩小到少于三个节点。Don't scale your cluster down to fewer than three nodes. 将群集缩放成少于三个节点可能导致系统停滞在安全模式下,因为没有进行充分的文件复制。Scaling your cluster to fewer than three nodes can result in it getting stuck in safe mode because of insufficient file replication. 有关详细信息,请参阅进入安全模式For more information, see Getting stuck in safe mode.

后续步骤Next steps

阅读有关在缩放准则中手动缩放群集的准则Read about guidelines for scaling clusters manually in Scaling guidelines