We know that parameter passing is valuable for
pig script reuse. One lesser known understanding is that parameters do
not simply pass variables to
pig scripts but rather (and more fundamentally) they pass text that replaces
placeholders in the script. This is a subtle but powerful difference: it
means that we can dynamically pass code alternatives to the script at run-time.
This allows us to build a script with the same larger purpose but whose logic,
settings, schema, storage types, UDFs, etc. can be swapped at runtime. This results in significantly fewer yet more flexible scripts
that need to be built and maintained across your projects, group or
In this article I will
show techniques to leverage this approach through a few script examples.
Keep in mind the goal is not the final examples themselves, but rather
the possibilities for your own scripting.
Key ideas and techniques
parameters as text
-param whose parameter value holds a code snippet
-param_file whose value loads a parameter file holding one or
more parameters whose values hold code snippets
passing multiple -param and -param_file to a pig script
-dryrun parameter that
shows the inline result of the parameter substitution (for understanding
In this example we want to load a dataset and insert a first column that is a key of one or more of the original columns. If the first column is a composite key, we will concatenate the values of multiple columns and separate each value with a dash.
Note: in all cases I am calling the script from a command line client. This command could be generated manually or via a program.
Technique 1: Simple parameter passing (values as variables)
Let's say we want to concatenate column 1 before column 0. The script would look as follows:
A = LOAD '$SRC' USING PigStorage(',');
X = FOREACH A generate CONCAT(CONCAT($1,'-'),$0), $0..;
STORE X into '$DEST';
But what if we wanted to concatenate columns 2 and 3, or columns 1, 2, 5 or 5,1,4, 7? By passing only parameters as variables, we would have to write a different script each time with different CONCAT logic, and then annoyingly give this script a similar but still different name.
Technique 2: Paramater passing code
Alternatively, we can maintain one script template and pass the CONCAT logic via the parameter.
The script would look like this:
A = LOAD '$SRC' USING PigStorage(',');
X = FOREACH A generate $CON;
STORE X into '$DEST';
and we would call the script in using any number of CONCAT logic possibilities, such as either of the following (I am showing only the new parameter here):
Note that I defining CONCAT logic in the -param value, and also which of the original fields to return.
Note also that I am wrapping the -param value in quotes to escape them from the shell script.
As owner of the CONCAT logic, you would of course also need to understand the dataset you are loading. For example, you would not want to CONCAT using a column index that did not exist for the dataset (e.g column 7 for a data set that has only 5 columns
Technique 3: Include -dryrun as a parameter to see the inline rendering
If you pass -dryrun in addition to the -param parameters, you will see the running pig script output a line like this:
2016-08-11 15:15:30,530 [main] INFO org.apache.pig.Main - Dry run completed. Substituted pig script is at keyGenerator.pig.substituted
When the script finishes, you will notice a file called keyGenerator.pig.substituted next to the actual script that run (keyGenerator.pig). The .substituted file shows the original script with all of the parameters values inlined, as if you hard-coded the full script. This shows the text replacement that occurs when the script is run. This is good development technique to see how your parameter values are represented in the running script.
Note that -dryrun produces the file as described above but does not execute the script. You could alternatively use -debug instead which will both produce the above file and execute the script. In an operational environment this may not be valuable because time the same script is run it will overwrite the contents of the .substituted file it produces.
In this example we develop a reusable script to clean and normalize datasets to desired standards using a library of UDFs we built.
Using the techniques from above, our script would like like this:
A = LOAD '$SRC' USING $LOADER;
B = FOREACH A generate $GEN;
STORE B into '$DEST' USING $STORAGE;
Note that in addition to source and destination paths, we are able to define LOAD details (storage type, schema) and STORE details (storage type);
I could for example run the following on the command line:
Using the same script template in the first instance I am using a UDF to apply a generic clean operation to each field in the entire line (the script knows the delimiter).
In the second instance I use the same script template to use different UDFs on each field, including both normalizing and cleaning. This requires knowledge of the schema, which is passed in with the LOADER parameter.
Note again the quotes to escape special characters in parameter values.
Here we have a special additional need for quotes. Pig specs require that when your parameter value has spaces, you need to wrap that in single quotes. Thus, notice
-param LOADER="'TextLoader() AS (line:chararray)'"
The double quotes are for shell escaping and spaces, and the single quotes are required by pig for spaces.
Technique 4: Store parameters in parameter files and select parameter file at run-time
The above is clearly clumsy on the command line. We could put some or all of the parameters in a parameter file and identify the file using -param_file.
For the second example above, the file contents would look like:
Given the above pig script template, one can only imagine the diverse number of parameters that could be passed into the same pig script to load, clean/normalize (or other) and store files in an optimized and multi-tenant way.
By defining the logfile path and reusing the same job name for the same set of parameters, we get the benefit of appending job failures to one job (as opposed to a new log file for each failure) and also of writing it in a location other than where the script is located.
Passing parameters as code and not simply as variables opens up a new world of flexibility and reuse for your pig scripting. Along with your imagination and experimentation, the above techniques can lead to significantly less time building and maintaining pig scripts and more time leveraging a set of templates and parameter files that you maintain. Give it a try and see where it takes you. Process more of your Big Data into business value data with more powerful scripting.