Use Azure PowerShell to run Pig jobs with HDInsight

This document provides an example of using Azure PowerShell to submit Pig jobs to a Hadoop on HDInsight cluster. Pig allows you to write MapReduce jobs by using a language (Pig Latin) that models data transformations, rather than map and reduce functions.

Note

This document does not provide a detailed description of what the Pig Latin statements used in the examples do. For information about the Pig Latin used in this example, see Use Pig with Hadoop on HDInsight.

Prerequisites

  • An Azure HDInsight cluster

    Important

    Linux is the only operating system used on HDInsight version 3.4 or greater. For more information, see HDInsight retirement on Windows.

  • A workstation with Azure PowerShell.

Important

Azure PowerShell support for managing HDInsight resources using Azure Service Manager is deprecated, and was removed on January 1, 2017. The steps in this document use the new HDInsight cmdlets that work with Azure Resource Manager.

Please follow the steps in Install and configure Azure PowerShell to install the latest version of Azure PowerShell. If you have scripts that need to be modified to use the new cmdlets that work with Azure Resource Manager, see Migrating to Azure Resource Manager-based development tools for HDInsight clusters for more information.

Run Pig jobs using PowerShell

Azure PowerShell provides cmdlets that allow you to remotely run Pig jobs on HDInsight. Internally, PowerShell uses REST calls to WebHCat running on the HDInsight cluster.

The following cmdlets are used when running Pig jobs on a remote HDInsight cluster:

  • Login-AzureRmAccount: Authenticates Azure PowerShell to your Azure Subscription
  • New-AzureRmHDInsightPigJobDefinition: Creates a job definition by using the specified Pig Latin statements
  • Start-AzureRmHDInsightJob: Sends the job definition to HDInsight, starts the job, and returns a job object that can be used to check the status of the job
  • Wait-AzureRmHDInsightJob: Uses the job object to check the status of the job. It waits until the job has completed, or the wait time has been exceeded.
  • Get-AzureRmHDInsightJobOutput: Used to retrieve the output of the job

The following steps demonstrate how to use these cmdlets to run a job on your HDInsight cluster.

  1. Using an editor, save the following code as pigjob.ps1.

    # Login to your Azure subscription
    # Is there an active Azure subscription?
    $sub = Get-AzureRmSubscription -ErrorAction SilentlyContinue
    if(-not($sub))
    {
        Add-AzureRmAccount
    }
    
    # Get cluster info
    $clusterName = Read-Host -Prompt "Enter the HDInsight cluster name"
    $creds=Get-Credential -Message "Enter the login for the cluster"
    
    #Store the Pig Latin into $QueryString
    $QueryString =  "LOGS = LOAD '/example/data/sample.log';" +
    "LEVELS = foreach LOGS generate REGEX_EXTRACT(`$0, '(TRACE|DEBUG|INFO|WARN|ERROR|FATAL)', 1)  as LOGLEVEL;" +
    "FILTEREDLEVELS = FILTER LEVELS by LOGLEVEL is not null;" +
    "GROUPEDLEVELS = GROUP FILTEREDLEVELS by LOGLEVEL;" +
    "FREQUENCIES = foreach GROUPEDLEVELS generate group as LOGLEVEL, COUNT(FILTEREDLEVELS.LOGLEVEL) as COUNT;" +
    "RESULT = order FREQUENCIES by COUNT desc;" +
    "DUMP RESULT;"
    
    
    #Create a new HDInsight Pig Job definition
    $pigJobDefinition = New-AzureRmHDInsightPigJobDefinition `
        -Query $QueryString `
        -Arguments "-w"
    
    # Start the Pig job on the HDInsight cluster
    Write-Host "Start the Pig job ..." -ForegroundColor Green
    $pigJob = Start-AzureRmHDInsightJob `
        -ClusterName $clusterName `
        -JobDefinition $pigJobDefinition `
        -HttpCredential $creds
    
    # Wait for the Pig job to complete
    Write-Host "Wait for the Pig job to complete ..." -ForegroundColor Green
    Wait-AzureRmHDInsightJob `
        -ClusterName $clusterName `
        -JobId $pigJob.JobId `
        -HttpCredential $creds
    
    # Display the output of the Pig job.
    Write-Host "Display the standard output ..." -ForegroundColor Green
    Get-AzureRmHDInsightJobOutput `
        -ClusterName $clusterName `
        -JobId $pigJob.JobId `
        -HttpCredential $creds
    
  2. Open a new Windows PowerShell command prompt. Change directories to the location of the pigjob.ps1 file, then use the following command to run the script:

     .\pigjob.ps1
    

    You are prompted to log in to your Azure subscription. Then, you are asked for the HTTPs/Admin account name and password for the HDInsight cluster.

  3. When the job completes, it should return information similar to the following text:

     Start the Pig job ...
     Wait for the Pig job to complete ...
     Display the standard output ...
     (TRACE,816)
     (DEBUG,434)
     (INFO,96)
     (WARN,11)
     (ERROR,6)
     (FATAL,2)
    

Troubleshooting

If no information is returned when the job completes, an error may have occurred during processing. To view error information for this job, add the following command to the end of the pigjob.ps1 file, save it, and then run it again.

# Print the output of the Pig job.
Write-Host "Display the standard error output ..." -ForegroundColor Green
Get-AzureRmHDInsightJobOutput `
        -Clustername $clusterName `
        -JobId $pigJob.JobId `
        -HttpCredential $creds `
        -DisplayOutputType StandardError

This returns the information that was written to STDERR on the server when you ran the job, and it may help determine why the job is failing.

Summary

As you can see, Azure PowerShell provides an easy way to run Pig jobs on an HDInsight cluster, monitor the job status, and retrieve the output.

Next steps

For general information about Pig in HDInsight:

For information about other ways you can work with Hadoop on HDInsight: