Created on 08-16-2016 10:03 PM
We know that parameter passing is valuable for pig script reuse. One lesser known understanding is that parameters do not simply pass variables to pig scripts but rather (and more fundamentally) they pass text that replaces placeholders in the script. This is a subtle but powerful difference: it means that we can dynamically pass code alternatives to the script at run-time. This allows us to build a script with the same larger purpose but whose logic, settings, schema, storage types, UDFs, etc. can be swapped at runtime. This results in significantly fewer yet more flexible scripts that need to be built and maintained across your projects, group or organization.
In this article I will show techniques to leverage this approach through a few script examples. Keep in mind the goal is not the final examples themselves, but rather the possibilities for your own scripting.
Key ideas and techniques are:
In this example we want to load a dataset and insert a first column that is a key of one or more of the original columns. If the first column is a composite key, we will concatenate the values of multiple columns and separate each value with a dash.
Note: in all cases I am calling the script from a command line client. This command could be generated manually or via a program.
Let's say we want to concatenate column 1 before column 0. The script would look as follows:
A = LOAD '$SRC' USING PigStorage(','); X = FOREACH A generate CONCAT(CONCAT($1,'-'),$0), $0..; STORE X into '$DEST';
and we trigger it with the following command
pig -param SRC=../datasets/myData.csv -param DEST=../output/myDataWithKey.csv -f keyGenerator.pig
But what if we wanted to concatenate columns 2 and 3, or columns 1, 2, 5 or 5,1,4, 7? By passing only parameters as variables, we would have to write a different script each time with different CONCAT logic, and then annoyingly give this script a similar but still different name.
Alternatively, we can maintain one script template and pass the CONCAT logic via the parameter.
The script would look like this:
A = LOAD '$SRC' USING PigStorage(','); X = FOREACH A generate $CON; STORE X into '$DEST';
and we would call the script in using any number of CONCAT logic possibilities, such as either of the following (I am showing only the new parameter here):
Note that I defining CONCAT logic in the -param value, and also which of the original fields to return.
Note also that I am wrapping the -param value in quotes to escape them from the shell script.
As owner of the CONCAT logic, you would of course also need to understand the dataset you are loading. For example, you would not want to CONCAT using a column index that did not exist for the dataset (e.g column 7 for a data set that has only 5 columns
If you pass -dryrun in addition to the -param parameters, you will see the running pig script output a line like this:
2016-08-11 15:15:30,530 [main] INFO org.apache.pig.Main - Dry run completed. Substituted pig script is at keyGenerator.pig.substituted
When the script finishes, you will notice a file called keyGenerator.pig.substituted next to the actual script that run (keyGenerator.pig). The .substituted file shows the original script with all of the parameters values inlined, as if you hard-coded the full script. This shows the text replacement that occurs when the script is run. This is good development technique to see how your parameter values are represented in the running script.
Note that -dryrun produces the file as described above but does not execute the script. You could alternatively use -debug instead which will both produce the above file and execute the script. In an operational environment this may not be valuable because time the same script is run it will overwrite the contents of the .substituted file it produces.
In this example we develop a reusable script to clean and normalize datasets to desired standards using a library of UDFs we built.
Using the techniques from above, our script would like like this:
REGISTER ./lib/pigUDFs.jar; A = LOAD '$SRC' USING $LOADER; B = FOREACH A generate $GEN; STORE B into '$DEST' USING $STORAGE;
Note that in addition to source and destination paths, we are able to define LOAD details (storage type, schema) and STORE details (storage type);
I could for example run the following on the command line:
pig \ -param SRC=data.txt \ -param DEST=../output/myFile.txt \ -param LOADER="'TextLoader() AS (line:chararray)'" \ -param GEN="clean.CLEAN_CSV(line)" \ -param STORAGE=PigStorage('|')" \ -f clean.pig
or I could run this:
pig \ -param "SRC=data.txt" \ -param "DEST=../output/myOtherFile.txt" \ -param LOADER="USING PigStorage(',') AS (lastname:chararray, firstname:chararray, ssn:chararray, position:chararray, startdate:chararray, tenure:int, salary:double)" \ -param GEN="clean.REDACT(lastname), clean.CLEAN_TOLOWER(firstname), clean.REDACT(ssn), clean.CLEAN_TOLOWER(position), normalize.NORMALIZE_DATE(startdate), tenure, salary" -param STORAGE="PigStorage('|')" \ -f clean.pig
Using the same script template in the first instance I am using a UDF to apply a generic clean operation to each field in the entire line (the script knows the delimiter).
In the second instance I use the same script template to use different UDFs on each field, including both normalizing and cleaning. This requires knowledge of the schema, which is passed in with the LOADER parameter.
Note again the quotes to escape special characters in parameter values.
Here we have a special additional need for quotes. Pig specs require that when your parameter value has spaces, you need to wrap that in single quotes. Thus, notice
-param LOADER="'TextLoader() AS (line:chararray)'"
The double quotes are for shell escaping and spaces, and the single quotes are required by pig for spaces.
The above is clearly clumsy on the command line. We could put some or all of the parameters in a parameter file and identify the file using -param_file.
For the second example above, the file contents would look like:
LOADER='USING PigStorage(',') AS (lastname:chararray, firstname:chararray, ssn:chararray, position:chararray, startdate:chararray, tenure:int, salary:double)' GEN='clean.REDACT(lastname), clean.CLEAN_TOLOWER(firstname), clean.REDACT(ssn), clean.CLEAN_TOLOWER(position), normalize.NORMALIZE_DATE(startdate), tenure, salary' STORAGE=PigStorage()
Note we only need the single quote wrappers to satisfy pig specs on spaces in parameter values.
We would now call the script as follows:
pig -param SRC=data.txt -param DEST=../output/xform.txt -param_file thisJobParams.txt -f clean.pig
We can store optimization settings in a set of parameter files and select which we want to implement at run-time. For example, imagine the following new parameter $OPTin the script
$OPT REGISTER ./lib/pigUDFs.jar; A = LOAD '$SRC' USING $LOADER; B = FOREACH A generate $GEN; STORE B into '$DEST' USING $STORAGE;
Now imagine, say, 10 files each with different optimization settings. One of the files could look like this:
OPT='\ SET opt.multiquery false \ SET io.sort.mb 2048 \ SET mapreduce.task.timeout 1800'
And our running of the script would identical to the command line at the end of Technique 4, but would have this additional parameter
We can thus have different parameter files that server different purposes. In the case here, the param file is used to inline optimization settings in the rendered script.
Since your scripts will receive reuse and concurrency, make them multi-tenant by passing in a job and log file name.
If our script looked like this:
SET job.name $JOBNAME SET pig.logfile ./pigLogPath/$JOBNAME $OPT REGISTER ./lib/pigUDFs.jar; A = LOAD '$SRC' USING $LOADER; B = FOREACH A generate $GEN; STORE B into '$DEST' USING $STORAGE;
we could pass in a unique name that includes line of business or group and job name. Thus, the full command line would look something like this:
pig -param SRC=myData.txt -param DEST=myOutput.txt -param_file myLob_MyJob_Params.txt -param_file chosenOptParams.txt -param myLob_myJobName -f clean.pig
Given the above pig script template, one can only imagine the diverse number of parameters that could be passed into the same pig script to load, clean/normalize (or other) and store files in an optimized and multi-tenant way.
By defining the logfile path and reusing the same job name for the same set of parameters, we get the benefit of appending job failures to one job (as opposed to a new log file for each failure) and also of writing it in a location other than where the script is located.
Passing parameters as code and not simply as variables opens up a new world of flexibility and reuse for your pig scripting. Along with your imagination and experimentation, the above techniques can lead to significantly less time building and maintaining pig scripts and more time leveraging a set of templates and parameter files that you maintain. Give it a try and see where it takes you. Process more of your Big Data into business value data with more powerful scripting.