Use Hive with Hadoop on HDInsight with Remote Desktop

In this article, you will learn how to connect to an HDInsight cluster by using Remote Desktop, and then run Hive queries by using the Hive Command-Line Interface (CLI).

Important

Remote Desktop is only available on HDInsight clusters that use Windows as the operating system. Linux is the only operating system used on HDInsight version 3.4 or greater. For more information, see HDInsight retirement on Windows.

For HDInsight 3.4 or greater, see Use Hive with HDInsight and Beeline for information on running Hive queries directly on the cluster from a command-line.

Prerequisites

To complete the steps in this article, you will need the following:

  • A Windows-based HDInsight (Hadoop on HDInsight) cluster
  • A client computer running Windows 10, Window 8, or Windows 7

Connect with Remote Desktop

Enable Remote Desktop for the HDInsight cluster, then connect to it by following the instructions at Connect to HDInsight clusters using RDP.

Use the Hive command

When you have connected to the desktop for the HDInsight cluster, use the following steps to work with Hive:

  1. From the HDInsight desktop, start the Hadoop Command Line.
  2. Enter the following command to start the Hive CLI:

     %hive_home%\bin\hive
    

    When the CLI has started, you will see the Hive CLI prompt: hive>.

  3. Using the CLI, enter the following statements to create a new table named log4jLogs using sample data:

     set hive.execution.engine=tez;
     DROP TABLE log4jLogs;
     CREATE EXTERNAL TABLE log4jLogs (t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string)
     ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
     STORED AS TEXTFILE LOCATION 'wasb:///example/data/';
     SELECT t4 AS sev, COUNT(*) AS count FROM log4jLogs WHERE t4 = '[ERROR]' AND INPUT__FILE__NAME LIKE '%.log' GROUP BY t4;
    

    These statements perform the following actions:

    • DROP TABLE: Deletes the table and the data file if the table already exists.
    • CREATE EXTERNAL TABLE: Creates a new 'external' table in Hive. External tables store only the table definition in Hive (the data is left in the original location).

      Note

      External tables should be used when you expect the underlying data to be updated by an external source (such as an automated data upload process) or by another MapReduce operation, but you always want Hive queries to use the latest data.

      Dropping an external table does not delete the data, only the table definition.

    • ROW FORMAT: Tells Hive how the data is formatted. In this case, the fields in each log are separated by a space.
    • STORED AS TEXTFILE LOCATION: Tells Hive where the data is stored (the example/data directory) and that it is stored as text.
    • SELECT: Selects a count of all rows where column t4 contains the value [ERROR]. This should return a value of 3 because there are three rows that contain this value.
    • INPUT__FILE__NAME LIKE '%.log' - Tells Hive that we should only return data from files ending in .log. This restricts the search to the sample.log file that contains the data, and keeps it from returning data from other example data files that do not match the schema we defined.
  4. Use the following statements to create a new 'internal' table named errorLogs:

     CREATE TABLE IF NOT EXISTS errorLogs (t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string) STORED AS ORC;
     INSERT OVERWRITE TABLE errorLogs SELECT t1, t2, t3, t4, t5, t6, t7 FROM log4jLogs WHERE t4 = '[ERROR]' AND INPUT__FILE__NAME LIKE '%.log';
    

    These statements perform the following actions:

    • CREATE TABLE IF NOT EXISTS: Creates a table if it does not already exist. Because the EXTERNAL keyword is not used, this is an internal table, which is stored in the Hive data warehouse and is managed completely by Hive.

      Note

      Unlike EXTERNAL tables, dropping an internal table also deletes the underlying data.

    • STORED AS ORC: Stores the data in optimized row columnar (ORC) format. This is a highly optimized and efficient format for storing Hive data.
    • INSERT OVERWRITE ... SELECT: Selects rows from the log4jLogs table that contain [ERROR], then inserts the data into the errorLogs table.

      To verify that only rows that contain [ERROR] in column t4 were stored to the errorLogs table, use the following statement to return all the rows from errorLogs:

      SELECT * from errorLogs;

      Three rows of data should be returned, all containing [ERROR] in column t4.

Summary

As you can see, the the Hive command provides an easy way to interactively run Hive queries on an HDInsight cluster, monitor the job status, and retrieve the output.

Next steps

For general information about Hive in HDInsight:

For information about other ways you can work with Hadoop on HDInsight:

If you are using Tez with Hive, see the following documents for debugging information: