Connect to Azure HDInsight and run Hive queries using Data Lake Tools for Visual Studio

Learn how to use Data Lake Tools for Visual Studio (it is also called Azure Data Lake and Stream Analytics Tools) to connect to Hadoop clusters in Azure HDInsight and submit Hive queries. For more information about using HDInsight, see Introduction to HDInsight and Get started with HDInsight. For more information about connecting to a Storm cluster, see Develop C# topologies for Apache Storm on HDInsight using Visual Studio.

Data Lake Tools for Visual Studio can be used to access both Data Lake Analytics and HDInsight. For the information about Data Lake Tools, see Tutorial: develop U-SQL scripts using Data Lake Tools for Visual Studio.

Prerequisites

To complete this tutorial and use the Data Lake Tools in Visual Studio, you need the following items:

Install and upgrade Data Lake Tools for Visual Studio

Data Lake Tools is installed by default for Visual Studio 2017. For older Visual Studio versions, you can install it using the Web Platform Installer. You must choose the one that matches your version of Visual Studio. If you don't have Visual Studio installed, you can install the latest Visual Studio Community and Azure SDK using the Web Platform Installer:

Data Lake Tools for Visual Studio Web Platform installer.

To update the tools

  1. Open Visual Studio.
  2. From the Tools menu, click Extensions and updates.
  3. Expand Updates, update Azure Data Lake and Stream Analytics Tools if there is any.

Note

Only the version 2.3.0.0 or later supports connecting to Interactive Query clusters and running interactive Hive queries.

Connect to Azure subscriptions

Data Lake Tools for Visual Studio allows you to connect to your HDInsight clusters, perform some basic management operations, and run Hive queries.

Note

For information on connecting to a generic Hadoop cluster, see Write and submit Hive queries using Visual Studio.

To connect to your Azure subscription

  1. Open Visual Studio.
  2. From the View menu, click Server Explorer to open the Server Explorer window.
  3. Expand Azure, and then expand HDInsight.

    Note

    Notice the HDInsight Task List window should be open. If you don't see it, click Other Windows from the View menu, and then click HDInsight Task List Window.

  4. Enter your Azure subscription credentials, and then click Sign In. Authentication is only required if you have never connected to the Azure subscription from Visual Studio on this workstation.
  5. In Server Explorer, you see a list of existing HDInsight clusters. If you don't have any clusters, you can create one by using the Azure portal, Azure PowerShell, or the HDInsight SDK. For more information, see Create HDInsight clusters.

    Data Lake Tools for Visual Studio Server Explorer cluster list

  6. Expand an HDInsight cluster. You see Hive Databases, a default storage account, linked storage accounts, and Hadoop Service log. You can further expand the entities.

After you've connected to your Azure subscription, you'll be able to do the following tasks:

To connect to the Azure portal from Visual Studio

  • From Server Explorer, expand Azure > HDInsight, right-click an HDInsight cluster, and then click Manage Cluster in Azure portal.

To ask questions and provide feedback from Visual Studio

  • From the Tools menu, click HDInsight, and then click MSDN Forum to ask questions, or click Give Feedback.

From Server Explorer, you can see the default storage account and any linked storage accounts. If you expand the default storage account, you can see the containers on the storage account. The default storage account and the default container are marked. You can also right-click any of the containers to view the contents.

Data Lake Tools for Visual Studio server explorer list linked resources

After opening a container, you can use the following buttons to upload, delete, and download blobs:

Data Lake Tools for Visual Studio server explorer blob operations

Run interactive Hive queries

Apache Hive is a data warehouse infrastructure built on Hadoop for providing data summarization, queries, and analysis. Data Lake Tools for Visual Studio supports running Hive queries from Visual Studio. For more information about Hive, see Use Hive with HDInsight.

Interactive Query leverages Hive on LLAP in Apache Hive 2.1 brings the interactivity to your complex data warehouse style queries on large datasets stored. Running Hive queries on Interactive Query is much faster comparing to the traditional Hive batch jobs. For more information about running Hive batch jobs, see Run Hive batch jobs.

Note

Running interactive Hive queries is only supported when you connect to a HDInsight Interactive Query cluster.

Data Lake Tools for Visual Studio also enables users to see what’s inside the Hive job by collecting and surfacing the YARN logs of certain Hive jobs.

View the hivesampletable

All HDInsight clusters come with a sample Hive table called hivesampletable. This Hive table is used to show how to list Hive tables, view the table schemas, and list the rows in the Hive table.

To list Hive tables and view Hive table schema

  1. From Server Explorer, expand Azure > HDInsight > the cluster of your choice > Hive Databases > Default > hivesampletable to see the table schema.
  2. Right-click hivesampletable, and then click View Top 100 Rows to list the rows. It is equivalent to running the following Hive query using Hive ODBC driver:

    SELECT * FROM hivesampletable LIMIT 100

    You can customize the row count.

    Data Lake Tools: HDInsight Hive Visual Studio schema query

Create Hive tables

You can use the GUI to create a Hive table or use Hive queries. For information about using Hive queries, see Run Hive queries.

To create a Hive table

  1. From Server Explorer, expand Azure > HDInsight Clusters an HDInsight cluster > Hive Databases, then right-click default, and click Create Table.
  2. Configure the table.
  3. Click Create Table to submit the job to create the new Hive table.

    Data Lake Tools: HDInsight Visual Studio Tools create hive table

Validate and run Hive queries

There are two ways to create and run Hive queries:

  • Create ad hoc queries
  • Create a Hive application

To create, validate, and run ad hoc queries

  1. From Server Explorer, expand Azure, and then expand HDInsight Clusters.
  2. Right-click the cluster where you want to run the query, and then click Write a Hive Query.
  3. Enter the Hive queries. Notice the Hive editor supports IntelliSense. Data Lake Tools for Visual Studio supports loading the remote metadata when you are editing your Hive script. For example, when you type "SELECT * FROM", the IntelliSense lists all the suggested table names. When a table name is specified, the column names are listed by the IntelliSense. The tools support almost all Hive DML statements, subqueries, and the built-in UDFs.

    Data Lake Tools: HDInsight Visual Studio Tools IntelliSense

    Data Lake Tools: HDInsight Visual Studio Tools IntelliSense

    Note

    Only the metadata of the cluster that is selected in HDInsight Toolbar is suggested.

  4. (Optional) Click Validate Script to check the script syntax errors.

    Data Lake Tools: Data Lake Tools for Visual Studio local validation

  5. Click Submit or Submit (Advanced). With the advanced submit option, you configure Job Name, Arguments, Additional Configurations, and Status Directory for the script:

    HDInsight Hadoop hive query

    After you submit the job, you see a Hive Job Summary window.

    Summary of an HDInsight Hadoop Hive query

  6. Use the Refresh button to update the status until the job status changes to Completed.
  7. Click the links at the bottom to see: Job Query, Job Output, Job log, or Yarn log.

To create and run a Hive solution

  1. From the FILE menu, click New, and then click Project.
  2. Select HDInsight from the left pane, select Hive Application in the middle pane, enter the properties, and then click OK.

    Data Lake Tools: HDInsight Visual Studio Tools new hive project

  3. From Solution Explorer, double-click Script.hql to open it.
  4. To validate the Hive script, you can click the Validate Script button, or right-click the script in the Hive editor, and then click Validate Script from the context menu.

View Hive jobs

You can view job queries, job output, job logs, and Yarn logs for Hive jobs. For more information, see the previous screenshot.

The most recent release of the tools allows you to see what’s inside your Hive jobs by collecting and surfacing YARN logs. A YARN log can help you investigating performance issues. For more information about how HDInsight collects YARN logs, see Access HDInsight Application Logs Programmatically.

To view Hive jobs

  1. From Server Explorer, expand Azure, and then expand HDInsight.
  2. Right-click an HDInsight cluster, and then click View Jobs. You see a list of the Hive jobs that ran on the cluster.
  3. Click a job in the job list to select it, and then use the Hive Job Summary window to open Job Query, Job Output, Job Log, or Yarn log.

    Data Lake Tools: HDInsight Visual Studio Tools view Hive jobs

Faster path Hive execution via HiveServer2

Note

This feature only works on HDInsight cluster version 3.2 and newer.

The Data Lake Tools used to submit Hive jobs via WebHCat (also known as Templeton). It took a long time to return job details and error information. In order to solve this performance issue, the Data Lake Tools executes Hive jobs directly in the cluster through HiveServer2, so that it bypasses RDP/SSH. In addition to better performance, users can also view Hive on Tez graphs, and the Task details.

For HDInsight cluster version 3.2 or later, you can see an Execute via HiveServer2 button:

Data Lake visual studio Tools execute via hiveserver2

And you can see the logs streamed back in real time and see the job graphs if the Hive query is executed in Tez.

Data Lake visual studio Tools fast path hive execution

Difference between executing queries via HiveServer2 and Submitting Queries via WebHCat

Even though executing queries via HiveServer2 has many performance benefits, it has several limitations. Some of the limitations are not suitable for production usage. The following table shows the differences:

Executing via HiveServer2 Submitting via WebHCat
Execute queries Eliminates the overhead in WebHCat (which launches a MapReduce Job named “TempletonControllerJob”). As long as a query is executed via WebHCat, WebHCat launches a MapReduce job, which introduces additional latency.
Stream logs back In near real time. The job execution logs are available only when the job is finished.
View job history If a query is executed via HiveServer2, its job history (job log, job output) is not preserved. The application can be viewed in YARN UI with limited information. If a query is executed via WebHCat, its job history (job log, job output) is preserved and can be viewed using Visual Studio/HDInsight SDK/PowerShell.
Close window Executing via HiveServer2 is a “synchronous” way so you must keep the windows open; if the windows are closed then the query execution is canceled. Submitting via WebHCat is a “asynchronous” way so you can submit the query via WebHCat and close Visual Studio. You can come back and see the results at any time.

Tez Hive job performance graph

The Data Lake Tools support showing performance graphs for the Hive jobs ran by the Tez execution engine. For information on enabling Tez, see use Hive in HDInsight. After you submit a Hive job in Visual Studio, Visual Studio shows you the graph when the job is completed. You might need to click the Refresh button to get the latest job status.

Note

This feature is only available for HDInsight cluster version above 3.2.4.593, and can only work for completed jobs (if you submitted your job through WebHCat; this graph shows when you execute your query through HiveServer2).

hadoop hive tez performance graph

To help you understand your Hive query better, the tools add the Hive Operator view in this release. You just need to double-click on the vertices of the job graph and you can see all the operators inside the vertex. You can also hover on a particular operator to view more details of this operator.

Task execution view for Hive on Tez jobs

The Task execution view for Hive on Tez jobs can be used to get structured & visualized information for Hive jobs, and get more job details. When there are performance issues, you can use the view to get further details. For example, how each task operates and the detailed information about each task (data read/write, schedule/start/end time, etc.), so that you can tune job configurations or system architecture based on the visualized information.

Data Lake Visual Studio Tools task execution view

Run Hive batch jobs

Apache Hive is a data warehouse infrastructure built on Hadoop for providing data summarization, queries, and analysis. Data Lake Tools for Visual Studio supports running Hive queries from Visual Studio. For more information about Hive, see Use Hive with HDInsight.

It is time consuming to test Hive script against a HDInsight cluster, with the exception of the Interactive Query cluster. It could take several minutes or more. Data Lake Tools for Visual Studio is capable of validating Hive script locally without connecting to a live cluster. For more information about running interactive queries, see Run interactive Hive queries.

Data Lake Tools for Visual Studio also enables users to see what’s inside the Hive job by collecting and surfacing the YARN logs of certain Hive jobs.

To learn more about how to run Hive batch jobs, see the Run interactive Hive queries section. The information in that section applies to running Hive Batch jobs with longer time.

Run Pig scripts

Data Lake Tools for Visual Studio supports creating and submit Pig scripts to HDInsight clusters. Users can create a Pig project from template, and then submit the script to HDInsight clusters.

Feedbacks & Known issues

  • Currently HiveServer2 results are displayed in pure text fashion, which is not ideal. Microsoft is working on fixing that.
  • If the results are started with NULL values, currently the results are not shown. We have fixed this issue and if you are blocked on this issue, contact the support team.
  • The HQL script created by Visual Studio is encoded depending on user’s local region setting. It may not execute correctly if user uploads the script to cluster as binary.

Next steps

In this article, you learned how to connect to HDInsight clusters from Visual Studio, using the Data Lake (HDInsight) Tools package, and how to run a Hive query. For more information, see: