Use MapReduce in Hadoop on HDInsight

Learn how to run MapReduce jobs on HDInsight clusters. Use the following table to discover the various ways that MapReduce can be used with HDInsight:

Use this... ...to do this ...with this cluster operating system ...from this client operating system
SSH Use the Hadoop command through SSH Linux Linux, Unix, Mac OS X, or Windows
REST Submit the job remotely by using REST (examples use cURL) Linux or Windows Linux, Unix, Mac OS X, or Windows
Windows PowerShell Submit the job remotely by using Windows PowerShell Linux or Windows Windows
Remote Desktop (HDInsight 3.2 and 3.3) Use the Hadoop command through Remote Desktop Windows Windows

Important

Linux is the only operating system used on HDInsight version 3.4 or greater. For more information, see HDInsight retirement on Windows.

What is MapReduce

Hadoop MapReduce is a software framework for writing jobs that process vast amounts of data. Input data is split into independent chunks. Each chunk is processed in parallel across the nodes in your cluster. A MapReduce job consists of two functions:

  • Mapper: Consumes input data, analyzes it (usually with filter and sorting operations), and emits tuples (key-value pairs)

  • Reducer: Consumes tuples emitted by the Mapper and performs a summary operation that creates a smaller, combined result from the Mapper data

A basic word count MapReduce job example is illustrated in the following diagram:

HDI.WordCountDiagram

The output of this job is a count of how many times each word occurred in the text.

  • The mapper takes each line from the input text as an input and breaks it into words. It emits a key/value pair each time a word occurs of the word is followed by a 1. The output is sorted before sending it to reducer.
  • The reducer sums these individual counts for each word and emits a single key/value pair that contains the word followed by the sum of its occurrences.

MapReduce can be implemented in various languages. Java is the most common implementation, and is used for demonstration purposes in this document.

Development languages

Languages or frameworks that are based on Java and the Java Virtual Machine can be ran directly as a MapReduce job. The example used in this document is a Java MapReduce application. Non-Java languages, such as C#, Python, or standalone executables, must use Hadoop streaming.

Hadoop streaming communicates with the mapper and reducer over STDIN and STDOUT. The mapper and reducer read data a line at a time from STDIN, and write the output to STDOUT. Each line read or emitted by the mapper and reducer must be in the format of a key/value pair, delimited by a tab character:

[key]/t[value]

For more information, see Hadoop Streaming.

For examples of using Hadoop streaming with HDInsight, see the following documents:

Example data

HDInsight provides various example data sets, which are stored in the /example/data and /HdiSamples directory. These directories are in the default storage for your cluster. In this document, we use the /example/data/gutenberg/davinci.txt file. This file contains the notebooks of Leonardo Da Vinci.

Example MapReduce

An example MapReduce word count application is included with your HDInsight cluster. This example is located at /example/jars/hadoop-mapreduce-examples.jar on the default storage for your cluster.

The following Java code is the source of the MapReduce application contained in the hadoop-mapreduce-examples.jar file:

package org.apache.hadoop.examples;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

    public static class TokenizerMapper
        extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
        }
    }
    }

    public static class IntSumReducer
        extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                        Context context
                        ) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
        sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
    }

    public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
        System.err.println("Usage: wordcount <in> <out>");
        System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

For instructions to write your own MapReduce applications, see the following documents:

Run the MapReduce

HDInsight can run HiveQL jobs by using various methods. Use the following table to decide which method is right for you, then follow the link for a walkthrough.

Use this... ...to do this ...with this cluster operating system ...from this client operating system
SSH Use the Hadoop command through SSH Linux Linux, Unix, Mac OS X, or Windows
Curl Submit the job remotely by using REST Linux or Windows Linux, Unix, Mac OS X, or Windows
Windows PowerShell Submit the job remotely by using Windows PowerShell Linux or Windows Windows
Remote Desktop (HDInsight 3.2 and 3.3) Use the Hadoop command through Remote Desktop Windows Windows

Important

Linux is the only operating system used on HDInsight version 3.4 or greater. For more information, see HDInsight retirement on Windows.

Next steps

To learn more about working with data in HDInsight, see the following documents: