Run Hive queries using PowerShell

This document provides an example of using Azure PowerShell in the Azure Resource Group mode to run Hive queries in a Hadoop on HDInsight cluster.

Note

This document does not provide a detailed description of what the HiveQL statements that are used in the examples do. For information on the HiveQL that is used in this example, see Use Hive with Hadoop on HDInsight.

Prerequisites

  • An Azure HDInsight cluster: It does not matter whether the cluster is Windows or Linux-based.

    Important

    Linux is the only operating system used on HDInsight version 3.4 or greater. For more information, see HDInsight retirement on Windows.

  • A workstation with Azure PowerShell.

Important

Azure PowerShell support for managing HDInsight resources using Azure Service Manager is deprecated, and was removed on January 1, 2017. The steps in this document use the new HDInsight cmdlets that work with Azure Resource Manager.

Please follow the steps in Install and configure Azure PowerShell to install the latest version of Azure PowerShell. If you have scripts that need to be modified to use the new cmdlets that work with Azure Resource Manager, see Migrating to Azure Resource Manager-based development tools for HDInsight clusters for more information.

Run Hive queries using Azure PowerShell

Azure PowerShell provides cmdlets that allow you to remotely run Hive queries on HDInsight. Internally, the cmdlets make REST calls to WebHCat on the HDInsight cluster.

The following cmdlets are used when running Hive queries in a remote HDInsight cluster:

  • Add-AzureRmAccount: Authenticates Azure PowerShell to your Azure subscription
  • New-AzureRmHDInsightHiveJobDefinition: Creates a job definition by using the specified HiveQL statements
  • Start-AzureRmHDInsightJob: Sends the job definition to HDInsight, starts the job, and returns a job object that can be used to check the status of the job
  • Wait-AzureRmHDInsightJob: Uses the job object to check the status of the job. It waits until the job completes or the wait time is exceeded.
  • Get-AzureRmHDInsightJobOutput: Used to retrieve the output of the job
  • Invoke-AzureRmHDInsightHiveJob: Used to run HiveQL statements. This cmdlet blocks the query completes, then returns the results
  • Use-AzureRmHDInsightCluster: Sets the current cluster to use for the Invoke-AzureRmHDInsightHiveJob command

The following steps demonstrate how to use these cmdlets to run a job in your HDInsight cluster:

  1. Using an editor, save the following code as hivejob.ps1.

    # Login to your Azure subscription
    # Is there an active Azure subscription?
    $sub = Get-AzureRmSubscription -ErrorAction SilentlyContinue
    if(-not($sub))
    {
        Add-AzureRmAccount
    }
    
    #Get cluster info
    $clusterName = Read-Host -Prompt "Enter the HDInsight cluster name"
    $creds=Get-Credential -Message "Enter the login for the cluster"
    
    #HiveQL
    #Note: set hive.execution.engine=tez; is not required for
    #      Linux-based HDInsight
    $queryString = "set hive.execution.engine=tez;" +
                "DROP TABLE log4jLogs;" +
                "CREATE EXTERNAL TABLE log4jLogs(t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION 'wasbs:///example/data/';" +
                "SELECT * FROM log4jLogs WHERE t4 = '[ERROR]';"
    
    #Create an HDInsight Hive job definition
    $hiveJobDefinition = New-AzureRmHDInsightHiveJobDefinition -Query $queryString 
    
    #Submit the job to the cluster
    Write-Host "Start the Hive job..." -ForegroundColor Green
    
    $hiveJob = Start-AzureRmHDInsightJob -ClusterName $clusterName -JobDefinition $hiveJobDefinition -ClusterCredential $creds
    
    #Wait for the Hive job to complete
    Write-Host "Wait for the job to complete..." -ForegroundColor Green
    Wait-AzureRmHDInsightJob -ClusterName $clusterName -JobId $hiveJob.JobId -ClusterCredential $creds
    
    # Print the output
    Write-Host "Display the standard output..." -ForegroundColor Green
    Get-AzureRmHDInsightJobOutput `
        -Clustername $clusterName `
        -JobId $hiveJob.JobId `
        -HttpCredential $creds
    
  2. Open a new Azure PowerShell command prompt. Change directories to the location of the hivejob.ps1 file, then use the following command to run the script:

     .\hivejob.ps1
    

    When the script runs, you are prompted to enter the cluster name and the HTTPS/Admin account credentials for the cluster. You may also be prompted to log in to your Azure subscription.

  3. When the job completes, it returns information similar to the following thext:

     Display the standard output...
     2012-02-03      18:35:34        SampleClass0    [ERROR] incorrect       id
     2012-02-03      18:55:54        SampleClass1    [ERROR] incorrect       id
     2012-02-03      19:25:27        SampleClass4    [ERROR] incorrect       id
    
  4. As mentioned earlier, Invoke-Hive can be used to run a query and wait for the response. Use the following script to see how Invoke-Hive works:

    # Login to your Azure subscription
    # Is there an active Azure subscription?
    $sub = Get-AzureRmSubscription -ErrorAction SilentlyContinue
    if(-not($sub))
    {
        Add-AzureRmAccount
    }
    
    #Get cluster info
    $clusterName = Read-Host -Prompt "Enter the HDInsight cluster name"
    $creds=Get-Credential -Message "Enter the login for the cluster"
    
    # Set the cluster to use
    Use-AzureRmHDInsightCluster -ClusterName $clusterName -HttpCredential $creds
    
    $queryString = "set hive.execution.engine=tez;" +
                "DROP TABLE log4jLogs;" +
                "CREATE EXTERNAL TABLE log4jLogs(t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION '/example/data/';" +
                "SELECT * FROM log4jLogs WHERE t4 = '[ERROR]';"
    Invoke-AzureRmHDInsightHiveJob `
        -StatusFolder "statusout" `
        -Query $queryString
    

    The output looks like the following text:

     2012-02-03    18:35:34    SampleClass0    [ERROR]    incorrect    id
     2012-02-03    18:55:54    SampleClass1    [ERROR]    incorrect    id
     2012-02-03    19:25:27    SampleClass4    [ERROR]    incorrect    id
    
    Note

    For longer HiveQL queries, you can use the Azure PowerShell Here-Strings cmdlet or HiveQL script files. The following snippet shows how to use the Invoke-Hive cmdlet to run a HiveQL script file. The HiveQL script file must be uploaded to wasbs://.

    Invoke-AzureRmHDInsightHiveJob -File "wasbs://<ContainerName>@<StorageAccountName>/<Path>/query.hql"

    For more information about Here-Strings, see Using Windows PowerShell Here-Strings.

Troubleshooting

If no information is returned when the job completes, an error may have occurred during processing. To view error information for this job, add the following to the end of the hivejob.ps1 file, save it, and then run it again.

# Print the output of the Hive job.
Get-AzureRmHDInsightJobOutput `
        -Clustername $clusterName `
        -JobId $job.JobId `
        -HttpCredential $creds `
        -DisplayOutputType StandardError

This cmdlet returns the information that is written to STDERR on the server when you ran the job.

Summary

As you can see, Azure PowerShell provides an easy way to run Hive queries in an HDInsight cluster, monitor the job status, and retrieve the output.

Next steps

For general information about Hive in HDInsight:

For information about other ways you can work with Hadoop on HDInsight: