Use MapReduce with Apache Hadoop on HDInsight with SSH
Learn how to submit MapReduce jobs from a Secure Shell (SSH) connection to HDInsight.
If you are already familiar with using Linux-based Apache Hadoop servers, but you are new to HDInsight, see Linux-based HDInsight tips.
A Linux-based HDInsight (Hadoop on HDInsight) cluster
An SSH client. For more information, see Use SSH with HDInsight
Connect with SSH
Connect to the cluster using SSH. For example, the following command connects to a cluster named myhdinsight as the sshuser account:
If you use a certificate key for SSH authentication, you may need to specify the location of the private key on your client system, for example:
ssh -i ~/mykey.key email@example.com
If you use a password for SSH authentication, you need to provide the password when prompted.
For more information on using SSH with HDInsight, see Use SSH with HDInsight.
Use Hadoop commands
After you are connected to the HDInsight cluster, use the following command to start a MapReduce job:
yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar wordcount /example/data/gutenberg/davinci.txt /example/data/WordCountOutput
This command starts the
wordcountclass, which is contained in the
hadoop-mapreduce-examples.jarfile. It uses the
/example/data/gutenberg/davinci.txtdocument as input, and output is stored at
For more information about this MapReduce job and the example data, see Use MapReduce in Apache Hadoop on HDInsight.
The job emits details as it processes, and it returns information similar to the following text when the job completes:
File Input Format Counters Bytes Read=1395666 File Output Format Counters Bytes Written=337623
When the job completes, use the following command to list the output files:
hdfs dfs -ls /example/data/WordCountOutput
This command display two files,
part-r-00000file contains the output for this job.
Some MapReduce jobs may split the results across multiple part-r-##### files. If so, use the ##### suffix to indicate the order of the files.
To view the output, use the following command:
hdfs dfs -cat /example/data/WordCountOutput/part-r-00000
This command displays a list of the words that are contained in the wasb://example/data/gutenberg/davinci.txt file and the number of times each word occurred. The following text is an example of the data that is contained in the file:
wreathed 3 wreathing 1 wreaths 1 wrecked 3 wrenching 1 wretched 6 wriggling 1
As you can see, Hadoop commands provide an easy way to run MapReduce jobs in an HDInsight cluster and then view the job output.
For general information about MapReduce jobs in HDInsight:
For information about other ways you can work with Hadoop on HDInsight: