Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Transformation with Pig

avatar
Expert Contributor

Hi, I imported the MS SQL Server data. Currently I struggle to effect change (sum, average, ...). Because I will wish to create a method based on parameter: the name of the db, the column or columns to convert and transform operator. Currently I use Pig but that does not make me do.

pig code:

grunt > donneees = LOAD '/xxx/xxx/xxx/'$input_dbb'/'$input_table'' USING ParquetLoader;

grunt > Blg_avg = FOREACH donneees GENERATE '$col1' , '$col2';

grunt > SET parquet.compression gzip;

grunt > STORE Blg_avg INTO '/xxx/xxx/xxx/xxx/$input_table' USING ParqueStorer;

grunt > DUMP Blg_moyenne;

1 ACCEPTED SOLUTION

avatar

Have you considered wrapping it into a shell script? Here is a simple example (test.sh)

#!/bin/bash

TMPFILE="/tmp/script.pig"

FUN=$1 # pass the pig function as first parameter
TAB=$2 # pass the column as second parameter

cat <<EOF > "$TMPFILE"
iris = load '/tmp/iris.data' using PigStorage(',') 
       as (sl:double, sw:double, pl:double, pw:double, species:chararray);
by_species = group iris by species;
result = foreach by_species generate group as species, $FUN(iris.$TAB);
dump result;
EOF

pig -x tez "$TMPFILE"

You can call it e.g. as

bash ./test.sh MAX sw

to get the maximum of column "sw", or

bash ./test.sh AVG sl

to get the average of column "sl"

View solution in original post

5 REPLIES 5

avatar
Master Guru

So, what's wrong? Do ParquetLoader and Storer work?

avatar
Expert Contributor

The difficulty is that I would like to ls generic function (AVG, SUM, ...) as a parameter so has running with oozie, the user can specify the operation (or function AVG, SUM, .. .) to be performed on the column they will also be specified as a parameter.

Basically what I want is to implement it is a function of tansformation that will be called each time the user will need. This function will take the parameter table that will undergo processing, columns and transformed the operator (AVG, SUM, AVG, ...)

Thank you !

avatar

Have you considered wrapping it into a shell script? Here is a simple example (test.sh)

#!/bin/bash

TMPFILE="/tmp/script.pig"

FUN=$1 # pass the pig function as first parameter
TAB=$2 # pass the column as second parameter

cat <<EOF > "$TMPFILE"
iris = load '/tmp/iris.data' using PigStorage(',') 
       as (sl:double, sw:double, pl:double, pw:double, species:chararray);
by_species = group iris by species;
result = foreach by_species generate group as species, $FUN(iris.$TAB);
dump result;
EOF

pig -x tez "$TMPFILE"

You can call it e.g. as

bash ./test.sh MAX sw

to get the maximum of column "sw", or

bash ./test.sh AVG sl

to get the average of column "sl"

avatar
Expert Contributor

Thank you Bernhard Walter, for your suggestions.

But I have some more for you please:

1. Given that I should have the ability to upload multiple files one after another, the schemes will be totally different. How can I manage case columns with the 'as'?

2. Can I replace TAB = $ 2 by TAB = $* in order to spend several column parameter?

Thank you !

avatar

Well, this goes now into bash programming: The part between the lines "cat << EOF" and "EOF" is a so called "here doc" that writes the actual pig script. Everything starting with $ is a variable ($0,$1,... are predefined in bash with $0 containing the script/program name, $1 the first actual parameter, $2 the second and so on; $@ gives back all provided parameters joined with a space ' ' by default). Note: Setting variables (e.g. FUN, TAB) is done without $, referencing with $

So you can add any logic before "cat << EOF" to set variables leveraging your input parameters and reference them in the "here doc" to get the pig script you want.

For more see e.g. http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO.html and http://www.tldp.org/LDP/abs/html/index.html (and at many other locations).