Hadoop tutorial: Get started using Hadoop in HDInsight
Learn how to create Hadoop clusters in HDInsight, and how to run Hive jobs in HDInsight. Apache Hive is the most popular component in the Hadoop ecosystem. Currently HDInsight comes with seven different cluster types. Each cluster type supports a different set of components. All cluster types support Hive. For a list of supported components in HDInsight, see What's new in the Hadoop cluster versions provided by HDInsight?
Billing for HDInsight clusters is prorated per minute, whether you are using them or not. Be sure to delete your cluster after you have finished using it. For more information, see How to delete an HDInsight cluster.
Before you begin this tutorial, you must have:
- Azure subscription: To create a free one-month trial account, browse to azure.microsoft.com/free.
Most of Hadoop jobs are batch jobs. You create a cluster, run some jobs, and then delete the cluster. In this section, you create a Hadoop cluster in HDInsight using an Azure Resource Manager template. Resource Manager template experience is not required for following this tutorial. For other cluster creation methods and understanding the properties used in this tutorial, see Create HDInsight clusters. Use the selector on the top of the page to choose your cluster creation options.
The Resource Manager template used in this tutorial is located in GitHub.
Click the following image to sign in to Azure and open the Resource Manager template in the Azure portal.
Enter or select the following values:
- Subscription: Select your Azure subscription.
- Resource group: Create a resource group or select an existing resource group. A resource group is a container of Azure components. In this case, the resource group contains the HDInsight cluster and the dependent Azure Storage account.
- Location: Select an Azure location where you want to create your cluster. Choose a location closer to you for better performance.
- Cluster Type: Select hadoop for this tutorial.
- Cluster Name: Enter a name for the Hadoop cluster.
- Cluster login name and password: The default login name is admin.
SSH username and password: The default username is sshuser. You can rename it.
Some properties have been hardcoded in the template. You can configure these values from the template.
Location: The location of the cluster and the dependent storage account share the same location as the resource group.
- Cluster version: 3.5
- OS Type: Linux
Number of worker nodes: 2
Each cluster has an Azure Storage account or an Azure Data Lake account dependency. It is referred as the default storage account. HDInsight cluster and its default storage account must be co-located in the same Azure region. Deleting clusters does not delete the storage account.
For more explanation of these properties, see Create Hadoop clusters in HDInsight.
Select I agree to the terms and conditions stated above and Pin to dashboard, and then click Purchase. You shall see a new tile titled Deploying Template deployment on the portal dashboard. It takes about around 20 minutes to create a cluster. Once the cluster is created, the caption of the tile is changed to the resource group name you specified. And the portal automatically opens the resource group in a new blade. You can see both the cluster and the default storage listed.
Click the cluster name to open the cluster in a new blade.
Run Hive queries
Apache Hive is the most popular component used in HDInsight. There are many ways to run Hive jobs in HDInsight. In this tutorial, you use the Ambari Hive view from the portal. For other methods for submitting Hive jobs, see Use Hive in HDInsight.
- From the previous screenshot, click Cluster Dashboard, and then click HDInsight Cluster Dashboard. You can also browse to https://<ClusterName>.azurehdinsight.net, where <ClusterName> is the cluster you created in the previous section to open Ambari.
- Enter the Hadoop username and password that you specified in the previous section. The default username is admin.
Open Hive View as shown in the following screenshot:
In the Query Editor section of the page, paste the following HiveQL statements into the worksheet:
Semi-colon is required by Hive.
Click Execute. A Query Process Results section should appear beneath the Query Editor and display information about the job.
Once the query has finished, The Query Process Results section displays the results of the operation. You shall see one table called hivesampletable. This sample Hive table comes with all the HDInsight clusters.
Repeat step 4 and step 5 to run the following query:
SELECT * FROM hivesampletable;
Note the Save results dropdown in the upper left of the Query Process Results section; you can use this to either download the results, or save them to HDInsight storage as a CSV file.
- Click History to get a list of the jobs.
After you have completed a Hive job, you can export the results to Azure SQL database or SQL Server database, you can also visualize the results using Excel. For more information about using Hive in HDInsight, see Use Hive and HiveQL with Hadoop in HDInsight to analyze a sample Apache log4j file.
Clean up the tutorial
After you complete the tutorial, you may want to delete the cluster. With HDInsight, your data is stored in Azure Storage, so you can safely delete a cluster when it is not in use. You are also charged for an HDInsight cluster, even when it is not in use. Since the charges for the cluster are many times more than the charges for storage, it makes economic sense to delete clusters when they are not in use.
Using Azure Data Factory, you can create HDInsight clusters on demand, and configure a TimeToLive setting to delete the clusters automatically.
To delete the cluster and/or the default storage account
- Sign in to the Azure portal.
- From the portal dashboard, click the tile with the resource group name you used when you created the cluster.
- Click Delete on the resource blade to delete the resource group, which contains the cluster and the default storage account; or click the cluster name on the Resources tile and then click Delete on the cluster blade. Note deleting the resource group deletes the storage account. If you want to keep the storage account, choose to delete the cluster only.
If you run into issues with creating HDInsight clusters, see access control requirements.
In this tutorial, you have learned how to create a Linux-based HDInsight cluster using a Resource Manager template, and how to perform basic Hive queries.
To learn more about analyzing data with HDInsight, see the following articles:
- To learn more about using Hive with HDInsight, including how to perform Hive queries from Visual Studio, see Use Hive with HDInsight.
- To learn about Pig, a language used to transform data, see Use Pig with HDInsight.
- To learn about MapReduce, a way to write programs that process data on Hadoop, see Use MapReduce with HDInsight.
- To learn about using the HDInsight Tools for Visual Studio to analyze data on HDInsight, see Get started using Visual Studio Hadoop tools for HDInsight.
If you're ready to start working with your own data and need to know more about how HDInsight stores data or how to get data into HDInsight, see the following:
- For information on how HDInsight uses Azure Storage, see Use Azure Storage with HDInsight.
- For information on how to upload data to HDInsight, see Upload data to HDInsight.
If you'd like to learn more about creating or managing an HDInsight cluster, see the following:
- To learn about managing your Linux-based HDInsight cluster, see Manage HDInsight clusters using Ambari.
- To learn more about the options you can select when creating an HDInsight cluster, see Creating HDInsight on Linux using custom options.
If you are familiar with Linux, and Hadoop, but want to know specifics about Hadoop on the HDInsight, see Working with HDInsight on Linux. This article provides information such as:
- URLs for services hosted on the cluster, such as Ambari and WebHCat
- The location of Hadoop files and examples on the local file system
- The use of Azure Storage (WASB) instead of HDFS as the default data store