Run Pig jobs on a Linux-based cluster with the Pig command (SSH)
Learn how to interactively run Pig jobs from an SSH connection to your HDInsight cluster. The Pig Latin programming language allows you to describe transformations that are applied to the input data to produce the desired output.
The steps in this document require a Linux-based HDInsight cluster. Linux is the only operating system used on HDInsight version 3.4 or greater. For more information, see HDInsight retirement on Windows.
Connect with SSH
Use SSH to connect to your HDInsight cluster. The following example connects to a cluster named myhdinsight as the account named sshuser:
For more information, see Use SSH with HDInsight.
Use the Pig command
Once connected, start the Pig command-line interface (CLI) by using the following command:
After a moment, the prompt changes to
Enter the following statement:
LOGS = LOAD '/example/data/sample.log';
This command loads the contents of the sample.log file into LOGS. You can view the contents of the file by using the following statement:
Next, transform the data by applying a regular expression to extract only the logging level from each record by using the following statement:
LEVELS = foreach LOGS generate REGEX_EXTRACT($0, '(TRACE|DEBUG|INFO|WARN|ERROR|FATAL)', 1) as LOGLEVEL;
You can use DUMP to view the data after the transformation. In this case, use
Continue applying transformations by using the statements in the following table:
Pig Latin statement What the statement does
FILTEREDLEVELS = FILTER LEVELS by LOGLEVEL is not null;
Removes rows that contain a null value for the log level and stores the results into
GROUPEDLEVELS = GROUP FILTEREDLEVELS by LOGLEVEL;
Groups the rows by log level and stores the results into
FREQUENCIES = foreach GROUPEDLEVELS generate group as LOGLEVEL, COUNT(FILTEREDLEVELS.LOGLEVEL) as COUNT;
Creates a set of data that contains each unique log level value and how many times it occurs. The data set is stored into
RESULT = order FREQUENCIES by COUNT desc;
Orders the log levels by count (descending) and stores into
DUMPto view the result of the transformation after each step.
You can also save the results of a transformation by using the
STOREstatement. For example, the following statement saves the
/example/data/pigoutdirectory on the default storage for your cluster:
STORE RESULT into '/example/data/pigout';
The data is stored in the specified directory in files named
part-nnnnn. If the directory already exists, you receive an error.
To exit the grunt prompt, enter the following statement:
Pig Latin batch files
You can also use the Pig command to run Pig Latin contained in a file.
After exiting the grunt prompt, use the following command to create file named
Type or paste the following lines:
LOGS = LOAD '/example/data/sample.log'; LEVELS = foreach LOGS generate REGEX_EXTRACT($0, '(TRACE|DEBUG|INFO|WARN|ERROR|FATAL)', 1) as LOGLEVEL; FILTEREDLEVELS = FILTER LEVELS by LOGLEVEL is not null; GROUPEDLEVELS = GROUP FILTEREDLEVELS by LOGLEVEL; FREQUENCIES = foreach GROUPEDLEVELS generate group as LOGLEVEL, COUNT(FILTEREDLEVELS.LOGLEVEL) as COUNT; RESULT = order FREQUENCIES by COUNT desc; DUMP RESULT;
When finished, use Ctrl + X, Y, and then Enter to save the file.
Use the following command to run the
pigbatch.pigfile by using the Pig command.
Once the batch job finishes, you see the following output:
(TRACE,816) (DEBUG,434) (INFO,96) (WARN,11) (ERROR,6) (FATAL,2)
For general information on Pig in HDInsight, see the following document:
For more information on other ways to work with Hadoop on HDInsight, see the following documents: