您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

使用 Apache Hadoop、Apache Spark、Apache Kafka 及其他组件在 HDInsight 中设置群集Set up clusters in HDInsight with Apache Hadoop, Apache Spark, Apache Kafka, and more

了解如何在 HDInsight 中设置和配置 Apache Hadoop、Apache Spark、Apache Kafka、交互式查询、Apache HBase、ML 服务或 Apache Storm。Learn how to set up and configure Apache Hadoop, Apache Spark, Apache Kafka, Interactive Query, Apache HBase, ML Services, or Apache Storm in HDInsight. 另外,了解如何自定义群集,并将它们加入域以提高安全性。Also, learn how to customize clusters and add security by joining them to a domain.

Hadoop 群集由用于对任务进行分布式处理的多个虚拟机(节点)组成。A Hadoop cluster consists of several virtual machines (nodes) that are used for distributed processing of tasks. Azure HDInsight 对各个节点的安装和配置的实现细节进行处理,因此用户只需提供常规配置信息。Azure HDInsight handles implementation details of installation and configuration of individual nodes, so you only have to provide general configuration information.

重要

创建群集后便开始 HDInsight 群集计费,删除群集后停止计费。HDInsight cluster billing starts once a cluster is created and stops when the cluster is deleted. 群集以每分钟按比例收费,因此无需再使用群集时,应始终将其删除。Billing is pro-rated per minute, so you should always delete your cluster when it is no longer in use. 了解如何删除群集Learn how to delete a cluster.

群集设置方法Cluster setup methods

下表显示可用于设置 HDInsight 群集的各种方法。The following table shows the different methods you can use to set up an HDInsight cluster.

群集创建方法Clusters created with Web 浏览器Web browser 命令行Command line REST APIREST API SDK 中 IsInRole 中的声明SDK
Azure 门户Azure portal      
Azure 数据工厂Azure Data Factory
Azure CLIAzure CLI      
Azure PowerShellAzure PowerShell      
cURLcURL    
Azure 资源管理器模板Azure Resource Manager templates      

本文逐步讲解Azure 门户中的设置,您可以在其中使用默认视图或经典来创建 HDInsight 群集。This article walks you through setup in the Azure portal, where you can create an HDInsight cluster using the default view or Classic.

基础知识Basics

hdinsight 创建选项自定义快速

项目详细信息Project details

Azure 资源管理器可帮助你以组的形式处理应用程序中的资源(称为 Azure资源组)。Azure Resource Manager helps you work with the resources in your application as a group, referred to as an Azure resource group. 可以通过单个协调的操作部署、更新、监视或删除应用程序的所有资源。You can deploy, update, monitor, or delete all the resources for your application in a single coordinated operation.

群集详细信息Cluster details

群集名称Cluster name

HDInsight 群集名称具有以下限制:HDInsight cluster names have the following restrictions:

  • 允许的字符: a-z、0-9、a-zAllowed characters: a-z, 0-9, A-Z
  • 最大长度:59Max length: 59
  • 保留名称:应用Reserved names: apps
  • 群集命名范围适用于所有订阅中的所有 Azure。The cluster naming scope is for all Azure, across all subscriptions. 因此,群集名称在全球范围内必须是唯一的。So the cluster name must be unique worldwide.
  • 在虚拟网络中,前六个字符必须唯一First six characters must be unique within a virtual network

区域Region

无需显式指定群集位置:群集位于默认存储所在的位置。You don't need to specify the cluster location explicitly: The cluster is in the same location as the default storage. 若要查看受支持区域的列表,请选择HDInsight 定价中的 "区域" 下拉列表。For a list of supported regions, select the Region drop-down list on HDInsight pricing.

群集类型Cluster type

Azure HDInsight 目前提供以下几种群集类型,每种类型都具有一组用于提供特定功能的组件。Azure HDInsight currently provides the following cluster types, each with a set of components to provide certain functionalities.

重要

HDInsight 群集类型繁多,每种类型适用于一种工作负荷或技术。HDInsight clusters are available in various types, each for a single workload or technology. 没有任何方法支持创建组合多种类型的群集,如一个群集同时具有 Storm 和 HBase 类型。There is no supported method to create a cluster that combines multiple types, such as Storm and HBase on one cluster. 如果解决方案需要分布在多种 HDInsight 群集类型上的技术, Azure 虚拟网络 可以连接所需的群集类型。If your solution requires technologies that are spread across multiple HDInsight cluster types, an Azure virtual network can connect the required cluster types.

群集类型Cluster type 功能Functionality
HadoopHadoop 批量查询和分析存储数据Batch query and analysis of stored data
HBaseHBase 处理大量无架构的 NoSQL 数据Processing for large amounts of schemaless, NoSQL data
交互式查询Interactive Query 更快的交互式 Hive 查询的内存中缓存In-memory caching for interactive and faster Hive queries
KafkaKafka 分布式流式处理平台,可用于构建实时流数据管道和应用程序A distributed streaming platform that can be used to build real-time streaming data pipelines and applications
ML ServicesML Services 各种大数据统计信息、预测模型和机器学习功能Various big data statistics, predictive modeling, and machine learning capabilities
SparkSpark 内存中处理、交互式查询、微批流处理In-memory processing, interactive queries, micro-batch stream processing
StormStorm 实时事件处理Real-time event processing

版本Version

选择此群集的 HDInsight 版本。Choose the version of HDInsight for this cluster. 有关详细信息,请参阅支持的 HDInsight 版本For more information, see Supported HDInsight versions.

群集凭据Cluster credentials

使用 HDInsight 群集时,可以在群集创建期间配置两个用户帐户:With HDInsight clusters, you can configure two user accounts during cluster creation:

  • 群集登录用户名:默认用户名为admin。它使用 Azure 门户上的基本配置。Cluster login username: The default username is admin. It uses the basic configuration on the Azure portal. 有时,它称为 "群集用户" 或 "HTTP 用户"。Sometimes it's called "Cluster user," or "HTTP user."
  • 安全外壳(SSH)用户名:用于通过 SSH 连接到群集。Secure Shell (SSH) username: Used to connect to the cluster through SSH. 有关详细信息,请参阅 Use SSH with HDInsight(对 HDInsight 使用 SSH)。For more information, see Use SSH with HDInsight.

HTTP 用户名具有以下限制:The HTTP username has the following restrictions:

  • 允许的特殊字符: _@Allowed special characters: _ and @
  • 不允许使用字符: #;。 "",/: "! *? $ (){}[] < > | &--= +% ~ ^ spaceCharacters not allowed: #;."',/:`!*?$(){}[]<>|&--=+%~^space
  • 最大长度:20Max length: 20

SSH 用户名具有以下限制:The SSH username has the following restrictions:

  • 允许的特殊字符:_@Allowed special characters:_ and @
  • 不允许使用字符: #;。 "",/: "! *? $ (){}[] < > | &--= +% ~ ^ spaceCharacters not allowed: #;."',/:`!*?$(){}[]<>|&--=+%~^space
  • 最大长度:64Max length: 64
  • 保留名称: hadoop、用户、oozie、hive、mapred、ambari、zookeeper、tez、hdfs、sqoop、yarn、hcat、ams、hbase、风暴、管理员、管理员、用户、user1、测试、用户 123 2、user3、admin1、actuser、管理员2、、、guest、john、物主、root、server、sql、support、support_388945a0、sys、test2、test3、user4、user5、sparkReserved names: hadoop, users, oozie, hive, mapred, ambari-qa, zookeeper, tez, hdfs, sqoop, yarn, hcat, ams, hbase, storm, administrator, admin, user, user1, test, user2, test1, user3, admin1, 1, 123, a, actuser, adm, admin2, aspnet, backup, console, david, guest, john, owner, root, server, sql, support, support_388945a0, sys, test2, test3, user4, user5, spark

存储Storage

群集存储设置: HDFS 兼容的终结点

虽然 Hadoop 的本地安装使用 Hadoop 分布式文件系统 (HDFS) 作为群集上的存储,但在云中,会使用连接到群集的存储终结点。Although an on-premises installation of Hadoop uses the Hadoop Distributed File System (HDFS) for storage on the cluster, in the cloud you use storage endpoints connected to cluster. 使用云存储意味着可以安全地删除用于计算的 HDInsight 群集,同时仍保留数据。Using cloud storage means you can safely delete the HDInsight clusters used for computation while still retaining your data.

HDInsight 群集可以使用以下存储选项:HDInsight clusters can use the following storage options:

  • Azure Data Lake Storage Gen2Azure Data Lake Storage Gen2
  • Azure Data Lake Storage Gen1Azure Data Lake Storage Gen1
  • Azure 存储常规用途 v2Azure Storage General Purpose v2
  • Azure 存储常规用途 v1Azure Storage General Purpose v1
  • Azure 存储块 blob (仅支持作为辅助存储Azure Storage Block blob (only supported as secondary storage)

有关 HDInsight 存储选项的详细信息,请参阅比较用于 Azure hdinsight 群集的存储选项For more information on storage options with HDInsight, see Compare storage options for use with Azure HDInsight clusters.

警告

不支持在 HDInsight 群集之外的其他位置使用其他存储帐户。Using an additional storage account in a different location from the HDInsight cluster is not supported.

在配置期间,对于默认存储终结点,需要指定 Azure 存储帐户的 Blob 容器或 Data Lake Storage。During configuration, for the default storage endpoint you specify a blob container of an Azure Storage account or Data Lake Storage. 默认存储包含应用程序日志和系统日志。The default storage contains application and system logs. 可以选择指定群集可访问的其他链接的 Azure 存储帐户和 Data Lake Storage 帐户。Optionally, you can specify additional linked Azure Storage accounts and Data Lake Storage accounts that the cluster can access. HDInsight 群集与从属存储帐户必须位于相同的 Azure 位置。The HDInsight cluster and the dependent storage accounts must be in the same Azure location.

备注

需要安全传输的功能强制通过安全连接来实施针对帐户的所有请求。The feature that requires secure transfer enforces all requests to your account through a secure connection. 仅 HDInsight 群集 3.6 或更高版本支持此功能。Only HDInsight cluster version 3.6 or newer supports this feature. 有关详细信息,请参阅在 Azure HDInsight 中使用安全传输存储帐户创建 Apache Hadoop 群集For more information, see Create Apache Hadoop cluster with secure transfer storage accounts in Azure HDInsight.

元存储设置Metastore settings

你可以创建可选的 Hive 或 Apache Oozie 元存储。You can create optional Hive or Apache Oozie metastores. 但是,并非所有群集类型都支持元存储,Azure SQL 数据仓库与元存储不兼容。However, not all cluster types support metastores, and Azure SQL Data Warehouse isn't compatible with metastores.

有关详细信息,请参阅在 Azure HDInsight 中使用外部元数据存储For more information, see Use external metadata stores in Azure HDInsight.

重要

创建自定义元存储时,请勿在数据库名称中使用短划线、连字符或空格。When you create a custom metastore, don't use dashes, hyphens, or spaces in the database name. 否则可能导致群集创建过程失败。This can cause the cluster creation process to fail.

Hive 的 SQL 数据库SQL database for Hive

如果想要在删除 HDInsight 群集后保留 Hive 表,请使用自定义元存储。If you want to retain your Hive tables after you delete an HDInsight cluster, use a custom metastore. 然后,可以将元存储附加到另一个 HDInsight 群集。You can then attach the metastore to another HDInsight cluster.

为一个 HDInsight 群集版本创建 An HDInsight 元存储不能在不同的 HDInsight 群集版本之间共享。An HDInsight metastore that is created for one HDInsight cluster version can't be shared across different HDInsight cluster versions. 有关 HDInsight 版本的列表,请参阅支持的 HDInsight 版本For a list of HDInsight versions, see Supported HDInsight versions.

适用于 Oozie 的 SQL 数据库SQL database for Oozie

若要提高使用 Oozie 时的性能,请使用自定义元存储。To increase performance when using Oozie, use a custom metastore. 元存储还可以在用户删除群集后提供对 Oozie 作业数据的访问。A metastore can also provide access to Oozie job data after you delete your cluster.

适用于 Ambari 的 SQL 数据库SQL database for Ambari

Ambari 用于监视 HDInsight 群集、进行配置更改,以及存储群集管理信息以及作业历史记录。Ambari is used to monitor HDInsight clusters, make configuration changes, and store cluster management information as well as job history. 使用自定义 Ambari DB 功能,你可以部署新的群集,并在你管理的外部数据库中设置 Ambari。The custom Ambari DB feature allows you to deploy a new cluster and setup Ambari in an external database that you manage. 有关详细信息,请参阅Custom AMBARI DBFor more information, see Custom Ambari DB.

重要

无法重用自定义 Oozie 元存储。You cannot reuse a custom Oozie metastore. 若要使用自定义 Oozie 元存储,必须在创建 HDInsight 群集时提供一个空的 Azure SQL 数据库。To use a custom Oozie metastore, you must provide an empty Azure SQL Database when creating the HDInsight cluster.

安全性 + 网络Security + networking

hdinsight 创建选项 选择企业安全数据包

企业安全数据包Enterprise security package

对于 Hadoop、Spark、HBase、Kafka 和交互式查询群集类型,可选择启用“企业安全性套餐”。For Hadoop, Spark, HBase, Kafka, and Interactive Query cluster types, you can choose to enable the Enterprise Security Package. 启用此数据包,可通过使用 Apache Ranger 并与 Azure Active Directory 集成来实现更安全的群集设置。This package provides option to have a more secure cluster setup by using Apache Ranger and integrating with Azure Active Directory. 有关详细信息,请参阅Azure HDInsight 中的企业安全性概述For more information, see Overview of enterprise security in Azure HDInsight.

企业安全数据包允许将 HDInsight 与 Active Directory 和 Apache Ranger 集成。The Enterprise security package allows you to integrate HDInsight with Active Directory and Apache Ranger. 可使用企业安全数据包创建多个用户。Multiple users can be created using the Enterprise security package.

有关如何创建已加入域的 HDInsight 群集的详细信息,请参阅创建已加入域的 HDInsight 沙盒环境For more information on creating domain-joined HDInsight cluster, see Create domain-joined HDInsight sandbox environment.

TLSTLS

有关详细信息,请参阅传输层安全性For more information, see Transport Layer Security

虚拟网络Virtual network

如果解决方案需要分布在多种 HDInsight 群集类型上的技术, Azure 虚拟网络 可以连接所需的群集类型。If your solution requires technologies that are spread across multiple HDInsight cluster types, an Azure virtual network can connect the required cluster types. 此配置允许群集以及部署到群集的任何代码直接相互通信。This configuration allows the clusters, and any code you deploy to them, to directly communicate with each other.

有关将 Azure 虚拟网络与 HDInsight 配合使用的详细信息,请参阅为Hdinsight 规划虚拟网络For more information on using an Azure virtual network with HDInsight, see Plan a virtual network for HDInsight.

有关在一个 Azure 虚拟网络中使用两种群集类型的示例,请参阅将 Apache Spark 结构化流式处理与 Apache Kafka 配合使用For an example of using two cluster types within an Azure virtual network, see Use Apache Spark Structured Streaming with Apache Kafka. 有关将 HDInsight 与虚拟网络配合使用的详细信息(包括虚拟网络的特定配置要求),请参阅为HDInsight 规划虚拟网络For more information about using HDInsight with a virtual network, including specific configuration requirements for the virtual network, see Plan a virtual network for HDInsight.

磁盘加密设置Disk encryption setting

有关详细信息,请参阅客户托管的密钥磁盘加密For more information, see Customer-managed key disk encryption.

Kafka REST 代理Kafka REST proxy

此设置仅适用于群集类型 Kafka。This setting is only available for cluster type Kafka. 有关详细信息,请参阅使用 REST 代理For more information, see Using a REST proxy.

身份Identity

有关详细信息,请参阅Azure HDInsight 中的托管标识For more information, see Managed identities in Azure HDInsight.

配置 + 定价Configuration + pricing

HDInsight 选择节点大小

只要群集存在,就会向你计费节点的使用情况。You're billed for node usage for as long as the cluster exists. 创建群集后便开始计费,删除群集后停止计费。Billing starts when a cluster is created and stops when the cluster is deleted. 无法取消分配群集或暂停群集。Clusters can’t be de-allocated or put on hold.

节点配置Node configuration

每种群集类型都有自身的节点数、节点术语和默认的 VM 大小。Each cluster type has its own number of nodes, terminology for nodes, and default VM size. 下表中的括号内列出了每个节点类型的节点数目。In the following table, the number of nodes for each node type is in parentheses.

类型Type 节点Nodes 图表Diagram
HadoopHadoop 头节点(2),辅助角色节点(1 +)Head node (2), Worker node (1+) HDInsight Hadoop 群集节点
HBaseHBase 头服务器 (2),区域服务器 (1+),主控/ZooKeeper 节点 (3)Head server (2), region server (1+), master/ZooKeeper node (3) HDInsight HBase 群集类型安装程序
StormStorm Nimbus 节点 (2),监督程序服务器 (1+),ZooKeeper 节点 (3)Nimbus node (2), supervisor server (1+), ZooKeeper node (3) HDInsight 风暴群集类型设置
SparkSpark 头节点(2),辅助角色节点(1 +),ZooKeeper 节点(3)(对于 A1 ZooKeeper VM 大小免费)Head node (2), Worker node (1+), ZooKeeper node (3) (free for A1 ZooKeeper VM size) HDInsight spark 群集类型安装程序

有关详细信息,请参阅“HDInsight 提供了哪些 Hadoop 组件和版本?”中的群集的默认节点配置和虚拟机大小For more information, see Default node configuration and virtual machine sizes for clusters in "What are the Hadoop components and versions in HDInsight?"

HDInsight 群集的成本取决于节点数和节点的虚拟机大小。The cost of HDInsight clusters is determined by the number of nodes and the virtual machines sizes for the nodes.

不同群集类型具有不同的节点类型、节点数和节点大小:Different cluster types have different node types, numbers of nodes, and node sizes:

  • Hadoop 群集类型默认具有:Hadoop cluster type default:
    • 两个头节点Two head nodes
    • 四个辅助角色节点Four Worker nodes
  • Storm 群集类型默认具有:Storm cluster type default:
    • 两个 Nimbus 节点Two Nimbus nodes
    • 三个 ZooKeeper 节点Three ZooKeeper nodes
    • 四个监督器节点Four supervisor nodes

如果刚尝试使用 HDInsight,我们建议使用一个辅助角色节点。If you're just trying out HDInsight, we recommend you use one Worker node. 有关 HDInsight 定价的详细信息,请参阅 HDInsight 定价For more information about HDInsight pricing, see HDInsight pricing.

备注

群集大小限制因 Azure 订阅而异。The cluster size limit varies among Azure subscriptions. 若要提高限制的大小,请联系 Azure 计费支持人员Contact Azure billing support to increase the limit.

使用 Azure 门户配置群集时,可通过 "配置 + 定价" 选项卡使用节点大小。在门户中,你还可以查看不同节点大小的相关成本。When you use the Azure portal to configure the cluster, the node size is available through the Configuration + pricing tab. In the portal, you can also see the cost associated with the different node sizes.

虚拟机大小Virtual machine sizes

部署群集时,根据计划部署的解决方案选择计算资源。When you deploy clusters, choose compute resources based on the solution you plan to deploy. 以下 VM 用于 HDInsight 群集:The following VMs are used for HDInsight clusters:

若要了解在使用各种 SDK 创建群集或使用 Azure PowerShell 时,应使用什么值指定 VM 大小,请参阅用于 HDInsight 群集的 VM 大小To find out what value you should use to specify a VM size while creating a cluster using the different SDKs or while using Azure PowerShell, see VM sizes to use for HDInsight clusters. 从此链接文章中,使用表“大小”列中的值。From this linked article, use the value in the Size column of the tables.

重要

如果群集中需要32个以上的辅助角色节点,则必须选择至少具有8个核心和 14 GB RAM 的头节点大小。If you need more than 32 Worker nodes in a cluster, you must select a head node size with at least 8 cores and 14 GB of RAM.

有关详细信息,请参阅虚拟机的大小For more information, see Sizes for virtual machines. 有关不同大小的定价信息,请参阅 HDInsight 定价For information about pricing of the various sizes, see HDInsight pricing.

添加应用程序Add application

HDInsight 应用程序是用户可以在基于 Linux 的 HDInsight 群集上安装的应用程序。An HDInsight application is an application that users can install on a Linux-based HDInsight cluster. 可以使用由 Microsoft 或第三方提供的应用程序,也可以使用自行开发的应用程序。You can use applications provided by Microsoft, third parties, or that you develop yourself. 有关详细信息,请参阅在 Azure HDInsight 上安装第三方 Apache Hadoop 应用程序For more information, see Install third-party Apache Hadoop applications on Azure HDInsight.

大多数 HDInsight 应用程序安装在空边缘节点上。Most of the HDInsight applications are installed on an empty edge node. 空边缘节点是安装并配置了与头节点中相同的客户端工具的 Linux 虚拟机。An empty edge node is a Linux virtual machine with the same client tools installed and configured as in the head node. 可以使用该边缘节点来访问群集、测试客户端应用程序和托管客户端应用程序。You can use the edge node for accessing the cluster, testing your client applications, and hosting your client applications. 有关详细信息,请参阅在 HDInsight 中使用空边缘节点For more information, see Use empty edge nodes in HDInsight.

脚本操作Script actions

你可以在创建期间通过使用脚本安装其他组件或自定义群集配置。You can install additional components or customize cluster configuration by using scripts during creation. 此类脚本可通过脚本操作调用,脚本操作是一种配置选项,可通过 Azure 门户、HDInsight Windows PowerShell cmdlet 或 HDInsight .NET SDK 使用。Such scripts are invoked via Script Action, which is a configuration option that can be used from the Azure portal, HDInsight Windows PowerShell cmdlets, or the HDInsight .NET SDK. 有关详细信息,请参阅使用脚本操作自定义 HDInsight 群集For more information, see Customize HDInsight cluster using Script Action.

某些本机 Java 组件(例如 Apache Mahout 和 Cascading)可以在群集上作为 Java 存档 (JAR) 文件运行。Some native Java components, like Apache Mahout and Cascading, can be run on the cluster as Java Archive (JAR) files. 可以通过 Hadoop 作业提交机制将这些 JAR 文件分发到 Azure 存储,并提交到 HDInsight 群集。These JAR files can be distributed to Azure Storage and submitted to HDInsight clusters with Hadoop job submission mechanisms. 有关详细信息,请参阅以编程方式提交 Apache Hadoop 作业For more information, see Submit Apache Hadoop jobs programmatically.

备注

如果在将 JAR 文件部署到 HDInsight 群集或调用 HDInsight 群集上的 JAR 文件时遇到问题,请联系 Microsoft 支持If you have issues deploying JAR files to HDInsight clusters, or calling JAR files on HDInsight clusters, contact Microsoft Support.

Cascading 不受 HDInsight 支持,因此不符合 Microsoft 技术支持的条件。Cascading is not supported by HDInsight and is not eligible for Microsoft Support. 有关支持的组件的列表,请参阅 HDInsight 提供的群集版本有哪些新功能?For lists of supported components, see What's new in the cluster versions provided by HDInsight.

在创建过程中,有时需要配置以下配置文件:Sometimes, you want to configure the following configuration files during the creation process:

  • clusterIdentity.xmlclusterIdentity.xml
  • core-site.xmlcore-site.xml
  • gateway.xmlgateway.xml
  • hbase-env.xmlhbase-env.xml
  • hbase-site.xmlhbase-site.xml
  • hdfs-site.xmlhdfs-site.xml
  • hive-env.xmlhive-env.xml
  • hive-site.xmlhive-site.xml
  • mapred-sitemapred-site
  • oozie-site.xmloozie-site.xml
  • oozie-env.xmloozie-env.xml
  • storm-site.xmlstorm-site.xml
  • tez-site.xmltez-site.xml
  • webhcat-site.xmlwebhcat-site.xml
  • yarn-site.xmlyarn-site.xml

有关详细信息,请参阅 使用 Bootstrap 自定义 HDInsight 群集For more information, see Customize HDInsight clusters using Bootstrap.

后续步骤Next steps