Run MapReduce jobs with Hadoop on HDInsight using PowerShell

This document provides an example of using Azure PowerShell to run a MapReduce job in a Hadoop on HDInsight cluster.

Prerequisites

  • An Azure HDInsight (Hadoop on HDInsight) cluster

    Important

    Linux is the only operating system used on HDInsight version 3.4 or greater. For more information, see HDInsight retirement on Windows.

  • A workstation with Azure PowerShell.

Run a MapReduce job using Azure PowerShell

Azure PowerShell provides cmdlets that allow you to remotely run MapReduce jobs on HDInsight. Internally, PowerShell makes REST calls to WebHCat (formerly called Templeton) running on the HDInsight cluster.

The following cmdlets are used when running MapReduce jobs in a remote HDInsight cluster.

  • Login-AzureRmAccount: Authenticates Azure PowerShell to your Azure subscription.

  • New-AzureRmHDInsightMapReduceJobDefinition: Creates a new job definition by using the specified MapReduce information.

  • Start-AzureRmHDInsightJob: Sends the job definition to HDInsight and starts the job. A job object is returned.

  • Wait-AzureRmHDInsightJob: Uses the job object to check the status of the job. It waits until the job completes or the wait time is exceeded.

  • Get-AzureRmHDInsightJobOutput: Used to retrieve the output of the job.

The following steps demonstrate how to use these cmdlets to run a job in your HDInsight cluster.

  1. Using an editor, save the following code as mapreducejob.ps1.

    # Login to your Azure subscription
    # Is there an active Azure subscription?
    $sub = Get-AzureRmSubscription -ErrorAction SilentlyContinue
    if(-not($sub))
    {
        Add-AzureRmAccount
    }
    
    # Get cluster info
    $clusterName = Read-Host -Prompt "Enter the HDInsight cluster name"
    $creds=Get-Credential -Message "Enter the login for the cluster"
    
    #Get the cluster info so we can get the resource group, storage, etc.
    $clusterInfo = Get-AzureRmHDInsightCluster -ClusterName $clusterName
    $resourceGroup = $clusterInfo.ResourceGroup
    $storageAccountName=$clusterInfo.DefaultStorageAccount.split('.')[0]
    $container=$clusterInfo.DefaultStorageContainer
    #NOTE: This assumes that the storage account is in the same resource
    #      group as the cluster. If it is not, change the
    #      --ResourceGroupName parameter to the group that contains storage.
    $storageAccountKey=(Get-AzureRmStorageAccountKey `
        -Name $storageAccountName `
    -ResourceGroupName $resourceGroup)[0].Value
    
    #Create a storage context
    $context = New-AzureStorageContext `
        -StorageAccountName $storageAccountName `
        -StorageAccountKey $storageAccountKey
    
    #Define the MapReduce job
    #NOTE: If using an HDInsight 2.0 cluster, use hadoop-examples.jar instead.
    # -JarFile = the JAR containing the MapReduce application
    # -ClassName = the class of the application
    # -Arguments = The input file, and the output directory
    $wordCountJobDefinition = New-AzureRmHDInsightMapReduceJobDefinition `
        -JarFile "/example/jars/hadoop-mapreduce-examples.jar" `
        -ClassName "wordcount" `
        -Arguments `
            "/example/data/gutenberg/davinci.txt", `
            "/example/data/WordCountOutput"
    
    #Submit the job to the cluster
    Write-Host "Start the MapReduce job..." -ForegroundColor Green
    $wordCountJob = Start-AzureRmHDInsightJob `
        -ClusterName $clusterName `
        -JobDefinition $wordCountJobDefinition `
        -HttpCredential $creds
    
    #Wait for the job to complete
    Write-Host "Wait for the job to complete..." -ForegroundColor Green
    Wait-AzureRmHDInsightJob `
        -ClusterName $clusterName `
        -JobId $wordCountJob.JobId `
        -HttpCredential $creds
    # Download the output
    Get-AzureStorageBlobContent `
        -Blob 'example/data/WordCountOutput/part-r-00000' `
        -Container $container `
        -Destination output.txt `
        -Context $context
    # Print the output of the job.
    Get-AzureRmHDInsightJobOutput `
        -Clustername $clusterName `
        -JobId $wordCountJob.JobId `
        -HttpCredential $creds
    
  2. Open a new Azure PowerShell command prompt. Change directories to the location of the mapreducejob.ps1 file, then use the following command to run the script:

     .\mapreducejob.ps1
    

    When you run the script, you are prompted for the name of the HDInsight cluster and the cluster login. You may also be prompted to authenticate to your Azure subscription.

  3. When the job completes, you receive output similar to the following text:

     Cluster         : CLUSTERNAME
     ExitCode        : 0
     Name            : wordcount
     PercentComplete : map 100% reduce 100%
     Query           :
     State           : Completed
     StatusDirectory : f1ed2028-afe8-402f-a24b-13cc17858097
     SubmissionTime  : 12/5/2014 8:34:09 PM
     JobId           : job_1415949758166_0071
    

    This output indicates that the job completed successfully.

    Note

    If the ExitCode is a value other than 0, see Troubleshooting.

    This example also stores the downloaded files to an output.txt file in the directory that you run the script from.

View output

To see the words and counts produced by the job, open the output.txt file in a text editor.

Note

The output files of a MapReduce job are immutable. So if you rerun this sample, you need to change the name of the output file.

Troubleshooting

If no information is returned when the job completes, view errors for the job. To view error information for this job, add the following command to the end of the mapreducejob.ps1 file, save it, and then run it again.

# Print the output of the WordCount job.
Write-Host "Display the standard output ..." -ForegroundColor Green
Get-AzureRmHDInsightJobOutput `
        -Clustername $clusterName `
        -JobId $wordCountJob.JobId `
        -HttpCredential $creds `
        -DisplayOutputType StandardError

This cmdlet returns the information that was written to STDERR as the job runs.

Summary

As you can see, Azure PowerShell provides an easy way to run MapReduce jobs on an HDInsight cluster, monitor the job status, and retrieve the output.

Next steps

For general information about MapReduce jobs in HDInsight:

For information about other ways you can work with Hadoop on HDInsight: