Create Hadoop clusters in HDInsight

A Hadoop cluster consists of several virtual machines (nodes) that are used for distributed processing of tasks on the cluster. Azure abstracts the implementation details of installation and configuration of individual nodes, so you only have to provide general configuration information. In this article, you learn about these configuration settings.

Access control requirements

You might use an Azure subscription for which you are not the administrator or owner, such as a company-owned subscription. If this is the case, you must verify that the following have been obtained in order to follow the steps in this article:

  • Contributor access. To sign in to Azure, you need at least Contributor access to the Azure resource group. This resource group is used to create an HDInsight cluster and other Azure resources.
  • Provider registration. Someone with at least Contributor access to the Azure subscription must have previously registered the provider for the resource you are using. Provider registration happens when a user with Contributor access to the subscription creates a resource for the first time on the subscription. It can also be accomplished without creating a resource by registering a provider through REST.

For more information on working with access management, see the following articles:

Cluster types

Currently, Azure HDInsight provides five different types of clusters, each with a set of components to provide certain functionalities.

Cluster type Functionality
Hadoop Query and analysis (batch jobs)
HBase NoSQL data storage
Storm Real-time event processing
Spark In-memory processing, interactive queries, micro-batch stream processing
Interactive Hive (Preview) In-memory caching for interactive and faster Hive queries
R Server on Spark A variety of big data statistics, predictive modeling, and machine learning capabilities
Kafka (Preview) A distributed streaming platform that can be used to build real-time streaming data pipelines and applications

Each cluster type has its own number of nodes within the cluster, terminology for nodes within the cluster, and default VM size for each node type. In the following table, the number of nodes for each node type is in parentheses.

Type Nodes Diagram
Hadoop Head node (2), data node (1+) HDInsight Hadoop cluster nodes
HBase Head server (2), region server (1+), master/ZooKeeper node (3) HDInsight HBase cluster nodes
Storm Nimbus node (2), supervisor server (1+), ZooKeeper node (3) HDInsight Storm cluster nodes
Spark Head node (2), worker node (1+), ZooKeeper node (3) (free for A1 ZooKeeper VM size) HDInsight Spark cluster nodes

The following tables list the default VM sizes for HDInsight:

  • All supported regions except Brazil South and Japan West:

    Cluster type Hadoop HBase Storm Spark R Server
    Head: default VM size D3 v2 D3 v2 A3 D12 v2 D12 v2
    Head: recommended VM sizes D3 v2, D4 v2, D12 v2 D3 v2, D4 v2, D12 v2 A3, A4, A5 D12 v2, D13 v2, D14 v2 D12 v2, D13 v2, D14 v2
    Worker: default VM size D3 v2 D3 v2 D3 v2 Windows: D12 v2; Linux: D4 v2 Windows: D12 v2; Linux: D4 v2
    Worker: recommended VM sizes D3 v2, D4 v2, D12 v2 D3 v2, D4 v2, D12 v2 D3 v2, D4 v2, D12 v2 Windows: D12 v2, D13 v2, D14 v2; Linux: D4 v2, D12 v2, D13 v2, D14 v2 Windows: D12 v2, D13 v2, D14 v2; Linux: D4 v2, D12 v2, D13 v2, D14 v2
    ZooKeeper: default VM size A3 A2
    ZooKeeper: recommended VM sizes A3, A4, A5 A2, A3, A4
    Edge: default VM size Windows: D12 v2; Linux: D4 v2
    Edge: recommended VM size Windows: D12 v2, D13 v2, D14 v2; Linux: D4 v2, D12 v2, D13 v2, D14 v2
  • Brazil South and Japan West only (no v2 sizes here):

    Cluster type Hadoop HBase Storm Spark R Server
    Head: default VM size D3 D3 A3 D12 D12
    Head: recommended VM sizes D3, D4, D12 D3, D4, D12 A3, A4, A5 D12, D13, D14 D12, D13, D14
    Worker: default VM size D3 D3 D3 Windows: D12; Linux: D4 Windows: D12; Linux: D4
    Worker: recommended VM sizes D3, D4, D12 D3, D4, D12 D3, D4, D12 Windows: D12, D13, D14; Linux: D4, D12, D13, D14 Windows: D12, D13, D14; Linux: D4, D12, D13, D14
    ZooKeeper: default VM size A2 A2
    ZooKeeper: recommended VM sizes A2, A3, A4 A2, A3, A4
    Edge: default VM sizes Windows: D12; Linux: D4
    Edge: recommended VM sizes Windows: D12, D13, D14; Linux: D4, D12, D13, D14
Note

Head is known as Nimbus for the Storm cluster type. Worker is known as Region for the HBase cluster type and as Supervisor for the Storm cluster type.

Important

If you plan on having more than 32 worker nodes, either at cluster creation or by scaling the cluster after creation, then you must select a head node size with at least 8 cores and 14 GB of RAM.

You can add other components such as Hue or R to these basic types by using script actions.

Important

HDInsight clusters come in a variety of types, which correspond to the workload or technology that the cluster is tuned for. There is no supported method to create a cluster that combines multiple types, such as Storm and HBase on one cluster.

If your solution requires technologies that are spread across multiple HDInsight cluster types, you should create an Azure virtual network and create the required cluster types within the virtual network. This configuration allows the clusters, and any code you deploy to them, to directly communicate with each other.

For more information on using an Azure virtual network with HDInsight, see Extend HDInsight with Azure virtual networks.

For an example of using two cluster types within an Azure virtual network, see Analyze sensor data with Storm and HBase.

Cluster tiers

Azure HDInsight provides the big data cloud offerings in two categories: Standard and Premium. HDInsight Premium includes R and other additional components. HDInsight Premium is supported only on HDInsight version 3.5.

The following table lists the HDInsight cluster type and HDInsight Premium support matrix.

Cluster type Standard Premium
Hadoop Yes Yes
Spark Yes Yes
HBase Yes No
Storm Yes No
R Server on Spark No Yes

This table will be updated as more cluster types are included in HDInsight Premium. The following screenshot shows the Azure portal information for choosing cluster types.

HDInsight premium configuration

Basic configuration options

The following are the basic configuration options used to create an HDInsight cluster.

Cluster name

The cluster name is used to identify a cluster. The cluster name must be globally unique, and it must adhere to the following naming guidelines:

  • The field must be a string that contains between 3 and 63 characters.
  • The field can contain only letters, numbers, and hyphens.

Cluster type

See Cluster types and Cluster tiers.

Operating system

You can create HDInsight clusters on one of the following two operating systems:

  • HDInsight on Linux. HDInsight provides the option of configuring Linux clusters on Azure. Configure a Linux cluster if you're familiar with Linux or Unix, you're migrating from an existing Linux-based Hadoop solution, or you want easy integration with Hadoop ecosystem components built for Linux. For more information, see Get started with Hadoop on Linux in HDInsight.
  • HDInsight on Windows (Windows Server 2012 R2 Datacenter).

HDInsight version

This option is used to determine the version of HDInsight needed for this cluster. For more information, see Hadoop cluster versions and components in HDInsight.

Subscription name

Each HDInsight cluster is tied to one Azure subscription.

Resource group name

Azure Resource Manager helps you work with the resources in your application as a group, referred to as an Azure resource group. You can deploy, update, monitor, or delete all of the resources for your application in a single coordinated operation.

Credentials

With HDInsight clusters, you can configure two user accounts during cluster creation:

  • HTTP user. The default user name is admin. It uses the basic configuration on the Azure portal. Sometimes it is called "Cluster user."
  • SSH user (Linux clusters). This is used to connect to the cluster through SSH. For more information, see Use SSH with HDInsight.

    Note

    For Windows-based clusters, you can create an RDP user to connect to the cluster by using RDP.

Data source

The original Hadoop Distributed File System (HDFS) uses many local disks on the cluster. HDInsight uses blobs in Azure Storage. Azure Storage is a robust, general-purpose storage solution that integrates seamlessly with HDInsight. Through an HDFS interface, the full set of components in HDInsight can operate directly on structured or unstructured data stored in blobs. Storing data in Azure Storage helps you safely delete the HDInsight clusters that are used for computation without losing user data.

Warning

HDInsight only supports General purpose Azure Storage accounts. It does not currently support the Blob storage account type.

During configuration, you must specify an Azure Storage account and a blob container on the Azure Storage account. Some creation processes require the Azure Storage account and the blob container to be created beforehand. The blob container is used as the default storage location by the cluster. Optionally, you can specify additional Azure Storage accounts (linked storage) that the cluster can access. The cluster can also access any blob containers that are configured with full public read access or public read access for blobs only. For more information, see Manage access to Azure storage resources.

HDInsight storage

Note

A blob container provides a grouping of a set of blobs as shown in the following image.

Azure blob

We do not recommend that you use the default blob container for storing business data. Deleting the default blob container after each use to reduce storage cost is a good practice. Note that the default container contains application and system logs. Make sure to retrieve the logs before deleting the container.

Warning

Sharing one blob container for multiple clusters is not supported.

For more information on using a secondary Azure Storage account, see Using Azure Storage with HDInsight.

In addition to Azure Storage, you can use Azure Data Lake Store as a default storage account for HBase cluster in HDInsight and as linked storage for all four HDInsight cluster types. For more information, see Create an HDInsight cluster with Data Lake Store using Azure portal.

Location (region)

The HDInsight cluster and its default storage account must be located at the same Azure location.

Azure regions

For a list of supported regions, click the Region drop-down list on HDInsight pricing.

Node pricing tiers

Customers are billed for the usage of those nodes for the duration of the cluster’s life. Billing starts when a cluster is created and stops when the cluster is deleted. Clusters can’t be de-allocated or put on hold.

Different cluster types have different node types, numbers of nodes, and node sizes. For example, a Hadoop cluster type has two head nodes and a default of four data nodes, while a Storm cluster type has two Nimbus nodes, three ZooKeeper nodes, and a default of four supervisor nodes. The cost of HDInsight clusters is determined by the number of nodes and the virtual machines sizes for the nodes. For example, if you know that you will be performing operations that need a lot of memory, you might want to select a compute resource with more memory. For learning purposes, we recommend that you use one data node. For more information about HDInsight pricing, see HDInsight pricing.

Note

The cluster size limit varies among Azure subscriptions. Contact billing support to increase the limit.

The nodes that your cluster uses do not count as virtual machines because the virtual machine images used for the nodes are an implementation detail of the HDInsight service. The compute cores used by the nodes do count against the total number of compute cores available to your subscription. When you create an HDInsight cluster, you can see the number of available cores and the cores that will be used by the cluster in the summary section of the Node Pricing Tiers blade.

When you use the Azure portal to configure the cluster, the node size is available through the Node Pricing Tiers blade. You can also see the cost associated with the different node sizes. The following screenshot shows the choices for a Linux-based Hadoop cluster.

HDInsight VM node sizes

The following tables show the sizes supported by HDInsight clusters, and the capacities they provide.

Standard tier: A-series

In the classic deployment model, some VM sizes are slightly different in PowerShell and the command-line interface (CLI).

  • Standard_A3 is Large
  • Standard_A4 is ExtraLarge
Size CPU cores Memory NICs (Max) Max. disk size Max. data disks (1023 GB each) Max. IOPS (500 per disk)
Standard_A3\Large 4 7 GB 2 Temporary = 285 GB 8 8x500
Standard_A4\ExtraLarge 8 14 GB 4 Temporary = 605 GB 16 16x500
Standard_A6 4 28 GB 2 Temporary = 285 GB 8 8x500
Standard_A7 8 56 GB 4 Temporary = 605 GB 16 16x500

Standard tier: D-series

Size CPU cores Memory NICs (Max) Max. disk size Max. data disks (1023 GB each) Max. IOPS (500 per disk)
Standard_D3 4 14 GB 4 Temporary (SSD) =200 GB 8 8x500
Standard_D4 8 28 GB 8 Temporary (SSD) =400 GB 16 16x500
Standard_D12 4 28 GB 4 Temporary (SSD) =200 GB 8 8x500
Standard_D13 8 56 GB 8 Temporary (SSD) =400 GB 16 16x500
Standard_D14 16 112 GB 8 Temporary (SSD) =800 GB 32 32x500

Standard tier: Dv2-series

Size CPU cores Memory NICs (Max) Max. disk size Max. data disks (1023 GB each) Max. IOPS (500 per disk)
Standard_D3_v2 4 14 GB 4 Temporary (SSD) =200 GB 8 8x500
Standard_D4_v2 8 28 GB 8 Temporary (SSD) =400 GB 16 16x500
Standard_D12_v2 4 28 GB 4 Temporary (SSD) =200 GB 8 8x500
Standard_D13_v2 8 56 GB 8 Temporary (SSD) =400 GB 16 16x500
Standard_D14_v2 16 112 GB 8 Temporary (SSD) =800 GB 32 32x500

For deployment considerations to be aware of when you're planning to use these resources, see Sizes for virtual machines. For information about pricing of the various sizes, see HDInsight pricing.

Important

If you plan on having more than 32 worker nodes, either at cluster creation or by scaling the cluster after creation, then you must select a head node size with at least 8 cores and 14 GB of RAM.

Billing starts when a cluster is created, and stops when the cluster is deleted. For more information on pricing, see HDInsight pricing details.

Use additional storage

In some cases, you might add additional storage to the cluster. For example, you might have multiple Azure storage accounts for different geographical regions or different services, but you want to analyze them all with HDInsight.

You can add storage accounts when you create an HDInsight cluster or after a cluster has been created. See Customize Linux-based HDInsight clusters using Script Action.

For more information about secondary Azure Storage account, see Using Azure Storage with HDInsight. For more information about secondary Data Lake Storage, see Create HDInsight clusters with Data Lake Store using Azure portal.

Warning

Using an additional storage account in a different location than the HDInsight cluster is not supported.

Use Hive/Oozie metastore

We recommend that you use a custom metastore if you want to retain your Hive tables after you delete your HDInsight cluster. You will be able to attach that metastore to another HDInsight cluster.

Important

An HDInsight metastore that is created for one HDInsight cluster version cannot be shared across different HDInsight cluster versions. For a list of HDInsight versions, see Supported HDInsight versions.

The metastore contains Hive and Oozie metadata, such as Hive tables, partitions, schemas, and columns. The metastore helps you retain your Hive and Oozie metadata, so you don't need to re-create Hive tables or Oozie jobs when you create a new cluster. By default, Hive uses an embedded Azure SQL database to store this information. The embedded database can't preserve the metadata when the cluster is deleted. When you create a Hive table in an HDInsight cluster with a Hive metastore configured, those tables will be retained when you re-create the cluster by using the same Hive metastore.

Metastore configuration is not available for HBase cluster types.

Important

When you create a custom metastore, do not use a database name that contains dashes or hyphens. This can cause the cluster creation process to fail.

Use Azure virtual networks

With Azure Virtual Network, you can create a secure, persistent network that contains the resources that you need for your solution. With a virtual network, you can:

  • Connect cloud resources together in a private network (cloud-only).

    Diagram of cloud-only configuration

  • Connect your cloud resources to your local datacenter network (site-to-site or point-to-site) by using a virtual private network (VPN).
Site-to-site configuration Point-to-site configuration
With site-to-site configuration, you can connect multiple resources from your datacenter to Azure Virtual Network by using a hardware VPN or the Routing and Remote Access Service.
Diagram of site-to-site configuration
With point-to-site configuration, you can connect a specific resource to the Azure virtual network by using a software VPN.
Diagram of point-to-site configuration

Windows-based clusters require a virtual network created in the classic deployment model. Linux-based clusters require a virtual network created in the Resource Manager deployment model. If you do not have the correct type of network, it will not be usable when you create the cluster.

For more information about using HDInsight with a virtual network, including specific configuration requirements for the virtual network, see Extend HDInsight capabilities by using Azure Virtual Network.

Customize clusters using HDInsight cluster customization (bootstrap)

Sometimes, you want to configure the following configuration files:

  • clusterIdentity.xml
  • core-site.xml
  • gateway.xml
  • hbase-env.xml
  • hbase-site.xml
  • hdfs-site.xml
  • hive-env.xml
  • hive-site.xml
  • mapred-site
  • oozie-site.xml
  • oozie-env.xml
  • storm-site.xml
  • tez-site.xml
  • webhcat-site.xml
  • yarn-site.xml

To keep the changes through the lifetime of a cluster, you can use HDInsight cluster customization during the creation process, or you can use Ambari in Linux-based clusters. For more information, see Customize HDInsight clusters using Bootstrap.

Note

The Windows-based clusters can't retain the changes due to re-image. For more information, see Role Instance Restarts Due to OS Upgrades. To keep the changes through the clusters' lifetime, you must use HDInsight cluster customization during the creation process.

Customize clusters using Script Action

You can install additional components or customize cluster configuration by using scripts during creation. Such scripts are invoked via Script Action, which is a configuration option that can be used from the Azure portal, HDInsight Windows PowerShell cmdlets, or the HDInsight .NET SDK. For more information, see Customize HDInsight cluster using Script Action.

Some native Java components, like Mahout and Cascading, can be run on the cluster as Java Archive (JAR) files. These JAR files can be distributed to Azure Storage and submitted to HDInsight clusters through Hadoop job submission mechanisms. For more information, see Submit Hadoop jobs programmatically.

Note

If you have issues deploying JAR files to HDInsight clusters, or calling JAR files on HDInsight clusters, contact Microsoft Support.

Cascading is not supported by HDInsight and is not eligible for Microsoft Support. For lists of supported components, see What's new in the cluster versions provided by HDInsight?.

Use edge node

An empty edge node is a Linux virtual machine with the same client tools installed and configured as in the head node. You can use the edge node for accessing the cluster, testing your client applications, and hosting your client applications. For more information, see Use empty edge nodes in HDInsight.

Cluster creation methods

In this article, you have learned basic information about creating a Linux-based HDInsight cluster. Use the following table to find specific information about how to create a cluster by using a method that best suits your needs.

Clusters created with Web browser Command line REST API SDK Linux, Mac OS X, or Unix Windows
The Azure portal      
Azure Data Factory
Azure CLI      
Azure PowerShell      
cURL    
.NET SDK      
Azure Resource Manager templates