Oink: Improving #Pig Development

Over the last couple (ok more than a couple) of months, we've taken a meandering stroll through the different parts and pieces that form the foundation of the Hadoop ecosystem. We've covered Hive, Mahout, Recommendation Engines and even a little bit about Pig. In this post we are going to again circle back to Pig and instead of focusing on the sexy, whiz-bang features we are going to look more at the practical everyday skills needed to work productively with Pig.

Pig Scripts

After you understand the basics of the Pig data model and your comfortable with Pig Latin (the Pig scripting language) you will inevitably author one or more scripts to process and transform the data stored within your Hadoop cluster. While it's possible to interactively run these scripts from the Grunt shell more commonly these scripts will be saved as files and run directly.

These files, called Pig Scripts, are typical named with an *.pig extension and can be executed using the following syntax:

 pig <FILENAME>.pig

To play along with this post, you will need to download the sample New York Stock Exchange data from Alan Gates, Programming Pig book found out on GitHub (HERE). Copy the Pig Latin script below into a file and name it demo.pig. This file can be executed using the above syntax and will dump each stock's maximum high and minimum low price for December 17, 2009.

 daily = load '/blog/example/data/NYSE_daily' as 
(
exchange:chararray, 
symbol:chararray, 
date:chararray, 
open:float, 
high:float, 
low:float, 
close:float, 
volume:int, 
adj_close:float); 

day = filter daily by date == '2009-12-17'; 
grpd = group day by symbol; 
minmax = foreach grpd generate flatten(day.symbol), MAX(day.high), MIN(day.low);

dump minmax;

Using this as our base, we will explore how parameters and macros work and how each will simplify your day-to-day work within Pig.

Parameters

The script we used above while useful, within the context of this post, has one potentially critical flaw in terms of reusability. The hard-coded file path (/blog/example/data/NYSE_daily) and date (2009-12-17) means that every time this script is processed, the same data will be returned. While this may be a desirable behavior, in my data processing experience what we need is something more flexible. This is where parameters or more accurately parameter substitution comes in.

Any Pig Latin script can be parameterized by substituting the $PARAM_NAME syntax for the desired value. To parameterize the sample you can simple replace:

  • '/blog/example/data/NYSE_daily' path in the load statement with $PATH
  • 2009-12-17 in the filter clause with $DATE

These parameters by default are required meaning that you must specify them for your script to run. If you attempt to run the script without the parameters an error message will result.

When you are ready to process your script, you can specify the parameters either inline or using a parameters file. Parameters can be passed inline by using the -p switch. Using our sample from above this command would look like:

 pig -p PATH=/blog/example/data/NYSE_daily -p DATE=2009-12-17 demo.pig

As of the writing of this post, the inline method discussed above does not work on HDInsight but works fine in the HortonWorks and Cloudera sandboxes. A second approach and one that does work on HDInsight is to use a parameter file. The parameter file can have any extension an simple contains the parameters and their values as seen below:

 #Param File
PATH = /blog/example/data/NYSE_daily
DATE=2009-12-17

This file is then included on the pig command using an -m or -param_file switch as used below:

 pig -m params.txt -f demo.pig

Occasionally, it will be necessary for you to either declare a parameter within a script or set a parameter default. Pig has facilities to handle both of these scenarios using %define and %default. To illustrate this, we could add the following default path to our demo script (above the parameter usage):

 %default PATH '/blog/example/data/NYSE_daily';

Now, the PATH parameter functions as an optional parameter meaning that when a parameter is passed in it will be used otherwise the default value provided within the script is used in its place.

Macros

Where parameters allow you to make your Pig Latin more dynamic, Macros are a feature that allow you to extract useful portions of your script out for reuse while making your scripts more compact and manageable. For those coming from a development background, Macros somewhat align with the C# or Java concept of a function.

Like functions, Macro are structured around a definition that includes a macro name, the input and any parameters and subsequently returns either a relation or NULL (VOID). The basic structure of a macro is as follows:

 define <NAME> (<INPUT>, <PARAMS.....>)
return <OUPUT>
{
    .....Pig Latin.....
};

After being defined, the macro can subsequently be called in script. Before looking at an example, there are a couple limitations to be aware of. First, the parameters discussed previously are not supported. Any parameters that are required within the Pig Latin script, must defined within the macro definition. This isn't so much a limitation once you are aware of it and architect your script and macro accordingly.

Second, unlike functions within high level languages there is no support for recursion within macros. This is relatively straight-forward in that a macro cannot call itself or be called in a circular reference.

Knowing these two things does not in anyway detract from a macro's usefulness. To better illustrate, let's transform our initial example into a macro using the code below. Note that for this demo, the macro has been defined in a separate file named macro.pig.

 define daily_stock_stats (daily, DATE)
returns results
{
    day = filter $daily by date == '$DATE'; 
    grpd = group day by symbol; 
    $results = foreach grpd generate flatten(day.symbol), MAX(day.high), MIN(day.low);
};

A few things are worth point out in the preceding script. First, we are still using parameters. Rather than parameter substitution however, the parameter value is passed into the macro as an argument. Also, the input (daily) and output (results) are both referenced using the same format as parameters, using $ notation.

With this macro defined, we can now use is directly in a script. First, we must reference the macro file using the import command after which the macro can be called inline by name as seen below:

 import 'macro.pig';

daily = load '$PATH' as 
(
exchange:chararray, 
symbol:chararray, 
date:chararray, 
open:float, 
high:float, 
low:float, 
close:float, 
volume:int, 
adj_close:float); 

results = daily_stock_stats (daily, '$DATE');

dump results;

This example points out that even through parameter substitution does not work within a macro, you can easily work around the limitation with a little planning. The following script can be executed as above, using the same params.txt file.

Wrap-Up

In this post, we looked at parameters and macros as a means of improving your day to day development experience. Both features will make your scripts more dynamic, more usable and certainly easier to manage/maintain. If you are interested in learned more in-depth details about these features you definitely should check out  Programming Pig by Alan Gates (on Amazon).

Till next time!

Chris