Get Started with Hive on HDInsight

Hi, my name is Dharshana and I work on the Big Data Support Team at Microsoft. As covered in the earlier post by Dan from our team, HDInsight provides a very easy to use interface to provision a Hadoop cluster with a few clicks and interact with the cluster programmatically. In this blog post, we will explore Hive and how it can be used with HDInsight.

Hive is very popular among data analysts/scientists as it provides a friendly SQL-like interface to interact with the Hadoop cluster. However, behind the scenes, Hive uses the MapReduce framework to submit jobs and get results back. When an HDInsight cluster is provisioned, HiveServer and HiveServer2 services are installed on the head node of the cluster. While HDInsight provides the option to interact with both the services, HiveServer2 is the service of choice since it addresses the concurrency limitation imposed by HiveServer.

Hive Implementation on HDInsight

You can find the version of HDInsight deployed with Hive here. The key point to note is that Hive is deployed automatically when provisioning the HDInsight cluster and several interfaces are available to interact with Hive. The backend data for Hive tables need to be uploaded into the Windows Azure BLOB Storage container.

Fig 1. Hive Architecture on HDInsight

Let us break the workflow illustrated in Fig 1. Above into greater detail to understand this.

  1. User has the option to submit a Hive query to the HDInsight cluster using several interfaces as described below:
  2. The user is authenticated by the secure role. If the request originates from SDK/PowerShell, request is passed on to WebHCat Endpoint on the head node. ODBC requests are handled by the HiveServer2 service.
  3. Hive provides schema on read, which is facilitated by the definition for the underlying data - also called the metadata - is stored on a Metastore database. With HDInsight, the Metastore database can be created on SQL Azure using the Hive Metastore configuration details provided when provisioning the cluster. Please note that only the SQL Azure databases residing in the same datacenter as the Azure HDInsight cluster are presented at the provisioning time.
  4. The Hive query is then parsed, compiled and broken down into MapReduce tasks, which is passed over to the JobTracker and TaskTrackers for execution. Data needed for processing is read in from the BLOB storage and the logs from execution are also written into BLOB Storage. Query specific logs are written into the specified default storage container and the service logs are written into the Azure table storage associated with the default storage account.

Provisioning a HDInsight Cluster

You may have noticed that with GA of Azure HDInsight, there are two versions available for provisioning your HDInsight Cluster – 1.6 and 2.1. We recommend that you install the latest version if you are planning on building your cluster for the first time! However, if you have an existing cluster on a version earlier than 2.1, say 1.6, then you have the option to point to the Hive Metastore associated with the 1.6 cluster when provisioning your Version 2.1 cluster.

The Azure HDInsight implementation detaches storage from the compute cluster, so when you tear down your HDInsight cluster, Storage Containers and Metastore databases are still left intact such that they can be reused.

A picture is worth a thousand words! Let us walk through the steps for an initial provisioning of an HDInsight Cluster. I am consciously choosing the older version of HDInsight – Version 1.6 – for this illustration, so we can upgrade this cluster to Version 2.1 on a later section.

  1. Provision Backend Metastore - Create SQL Azure database to hold the Metastore for Hive– HiveMetaStoreDB.

    Fig 2. Hive Metastore Database

  2. Provision Storage Containers – I like to logically separate Hive tables into different storage containers and sometimes it helps to distribute those across subscriptions as well. So, here I am creating three different storage containers for the HDInsight Cluster: One Primary Storage Container (myprimarystorage), Secondary Storage Container (mysecondarystorage) from another subscription, and a third Storage Container from this same subscription to hold Hive Library files (myhivelibs). Screen clip showing the storage account configuration is below.

    Fig 3. Hive Metastore Database

    Did you notice the pattern so far? See how the SQL Azure databases and the storage accounts we created so far are all on the same datacenter and an additional requirement is that the storage account can NOT belong to any affinity group.

    Using one of the cloud storage explorers, such as Azure Storage Explorer, you can pre-provision the storage containers that you need for the HDInsight Cluster. This gives you some control over how you want to name your containers. The screen clip below shows how I have the containers configured –

    Fig 4. Storage Containers

  3. Create the HDInsight Cluster – There are programmatic interfaces like PowerShell, .NET SDK, and Node.js available to script this process for automation. However, to demonstrate this visually, I am leveraging the friendly graphical user interface available from Management Portal. On selecting HDInsight, you can click on the NEW button at the bottom left hand side corner of your Management Portal to create a new HDInsight cluster.


    Fig 5. New HDInsight Cluster

    I am choosing Custom Create Option here and specifying the Cluster Name, Version of 1.6, Number of Data Nodes and the Subscription Name.  Only the storage accounts and SQL Azure databases available on the same region as the HDInsight Cluster, are presented at the time of provisioning the cluster. So, on this specific GUI screen, please make sure you specify the same region as on Steps 1 and 2 above.

    Fig 6. New HDInsight Cluster – Custom Create

    The next screen lets you configure the cluster user and the SQL Azure Metastore Information –

    Fig 7. Configure Cluster User – Custom Create

    At this step, only the SQL Azure Databases available on the same datacenter as the cluster are presented. This step also validates your access into the SQL Azure database that is specified. Note that, when you use programmatic interfaces to provision the cluster, it is a good idea to validate connectivity through SQL Azure Management Portal from your client machine as a pre-validation step.

    The next screen lets you specify the primary storage account and shows the containers that already exist. All the samples and default JAR files are stored on the primary storage container that is specified on this step.  If no containers exist on the selected storage account, you have the option to create a default container, which will have the same name as the storage account. You can also specify the number of additional storage accounts that you want to configure for this cluster here and it will display that many configuration pages. As an example, in the following screen clip, since I specified two additional storage accounts, two additional pages – pages 4 and 5 – are created.

                Fig 8. Configure Storage Account – Custom Create

    The process to configure the other storage accounts is very similar! You would need to choose "Use Storage From Another Subscription" option for Storage Account, to point to a storage account and container from another Subscription. You need to key in the storage account name and key to point to any storage account from another subscription.

            Fig 9. Configure Storage Account from another Subscription– Custom Create

    Once you have configured all the storage accounts, the cluster creation process starts and typically within about 15 minutes, you have a cluster ready to use!

    Fig 10. HDInsight Cluster Creation Complete – Custom Create

Execute a Hive Query on the newly provisioned Cluster!    

Now that we have a new HDInsight Cluster, let us upload some sample data into the primary and secondary storage container, create Hive tables and execute a Hive Query!  You can use any storage explorer tool, like Azure Storage Explorer, to upload data into your configured containers. I am using the sample data from here: I am working with the "" dataset and in specific, the file - "". The data snippet from this file is below:

Fig 11. HDInsight Cluster Creation Complete – Custom Create

Let us remove the header row from the CSV file and upload this file into the Storage Container: "factdata" on the Secondary Storage Account: "mysecondarystorage". If header row is not removed, extra filtering logic needs to be added to Hive queries to ignore the header row. There is currently not an easy way to let Hive ignore the header row when loading data into an external table. There is no concept of directories on Azure BLOB storage, but you can achieve a logical notion of organizing files into directory by naming them with <Directory>/<Filename>. For our file, let us go ahead and name it as "incomegroupbycountry/". All related files can be named with a prefix of "IncomeGroupByCountry" to have a logical binding.

Fig 12. Data file on Storage Container: "factdata"

Discussion on Hive External and Internal tables is out of scope for this discussion, however if you are interested in learning more about it, you can read more about it on this post. I am creating an external table for the purpose of our exercise. I put in the code on an .hql file and uploaded it to the storage container "hivecustomlibs" on the storage account "myhivelibs".

Fig 13. HQL File on storage container: "hivecustomlibs"

The HQL can be executed from PowerShell. Detailed instructions on setting up the PowerShell Environment for HDInsight is available here

$subscriptionName = "Your Subscription Name"
$clusterName = "HDI16"
$queryString = "select * from IncomeGroupByCountry limit 10;" # this query selects 10 rows from the newly created table 

Select-AzureSubscription -SubscriptionName $subscriptionName
Use-AzureHDInsightCluster $clusterName -Subscription (Get-AzureSubscription -Current).SubscriptionId

Invoke-Hive –File "wasb://"
Invoke-Hive $queryString

The HQL file contains the following commands:

drop table if exists IncomeGroupByCountry;
create external table IncomeGroupByCountry(CountryName STRING, CountryCode STRING, Region STRING, IncomeGroup STRING, SpecialNotes STRING)
row format delimited
fields terminated by ','
stored as textfile location 'wasb://';

The above HQL drops the table IncomeGroupByCountry if it already exists, and creates the table and points it to the secondary storage account location. Then an Invoke-Hive cmdlet to select 10 rows from table IncomeGroupByCountry is executed. If everything is working, you should see 10 rows being printed on the console now!

Upgrading your HDInsight Cluster

One nice thing about HDInsight is the ease of upgrade. HDInsight implementation model makes it very easy to tear down the compute cluster and rebuild a new version of the cluster and point it to the existing BLOB containers and SQL Azure metadata. If the new cluster has a higher version of Hive than the older cluster, then the existing SQL Azure Metastore from the older cluster is automatically upgraded at the time of provisioning!

Let us upgrade our HDInsight 1.6 cluster to HDInsight 2.1. Note that on this illustration, we will test the Metastore upgrade by selecting from the IncomeGroupByCountry table that we created on the previous step. When creating the new cluster, any customization that were added like Hive JARs or configuration options, need to be added back. If you interested to see a blog post on this, please let us know! :)

  1. You can delete the HDI16 cluster that we created on the previous step from the management portal. You will notice that the SQL Azure Metastore Databases and the Storage Accounts and Containers remain even after the cluster is deleted. 

  2. Now, let us create a new cluster – we could use the same old name. Note that if the HDInsight Version is left as default, the latest released version of HDInsight will be used by default – so in our case, choosing 2.1 explicitly or Default will both build a version 2.1 cluster. 

    Fig 14. New Version 2.1 Cluster – Custom Create

  3. On the page for configuration of Metastore database, we can point the Metastore database HiveMetaStoreDB from the older cluster to this new cluster.

    Fig 15. Configure Metastore Database – Custom Create

  4. When configuring storage account, we can point to the existing storage accounts that were being used with the older cluster.

    Fig 16. Configure Storage Account – Custom Create

  5. Now that you have the cluster ready, run the following Hive cmdlet to select from IncomeGroupByCountry table and see if you get results!

    $subscriptionName = "Your Subscription Name"
    $clusterName = "HDI16"
    $queryString = "select * from IncomeGroupByCountry limit 10;" # this query selects 10 rows from the newly created table 

    Select-AzureSubscription -SubscriptionName $subscriptionName
    Use-AzureHDInsightCluster $clusterName -Subscription (Get-AzureSubscription -Current).SubscriptionId

    Invoke-Hive –File "wasb://"
    Invoke-Hive $queryString

Hope you enjoyed the post as much as I did writing this! Thanks for reading through this and please let us know if you have any questions on this content or feedback.

@Dharshana (MSFT)