Created 05-09-2016 09:15 AM
Hi, I imported the MS SQL Server data. Currently I struggle to effect change (sum, average, ...). Because I will wish to create a method based on parameter: the name of the db, the column or columns to convert and transform operator. Currently I use Pig but that does not make me do.
pig code:
grunt > donneees = LOAD '/xxx/xxx/xxx/'$input_dbb'/'$input_table'' USING ParquetLoader;
grunt > Blg_avg = FOREACH donneees GENERATE '$col1' , '$col2';
grunt > SET parquet.compression gzip;
grunt > STORE Blg_avg INTO '/xxx/xxx/xxx/xxx/$input_table' USING ParqueStorer;
grunt > DUMP Blg_moyenne;
Created 05-09-2016 02:02 PM
Have you considered wrapping it into a shell script? Here is a simple example (test.sh)
#!/bin/bash TMPFILE="/tmp/script.pig" FUN=$1 # pass the pig function as first parameter TAB=$2 # pass the column as second parameter cat <<EOF > "$TMPFILE" iris = load '/tmp/iris.data' using PigStorage(',') as (sl:double, sw:double, pl:double, pw:double, species:chararray); by_species = group iris by species; result = foreach by_species generate group as species, $FUN(iris.$TAB); dump result; EOF pig -x tez "$TMPFILE"
You can call it e.g. as
bash ./test.sh MAX sw
to get the maximum of column "sw", or
bash ./test.sh AVG sl
to get the average of column "sl"
Created 05-09-2016 10:13 AM
So, what's wrong? Do ParquetLoader and Storer work?
Created 05-09-2016 12:40 PM
The difficulty is that I would like to ls generic function (AVG, SUM, ...) as a parameter so has running with oozie, the user can specify the operation (or function AVG, SUM, .. .) to be performed on the column they will also be specified as a parameter.
Basically what I want is to implement it is a function of tansformation that will be called each time the user will need. This function will take the parameter table that will undergo processing, columns and transformed the operator (AVG, SUM, AVG, ...)
Thank you !
Created 05-09-2016 02:02 PM
Have you considered wrapping it into a shell script? Here is a simple example (test.sh)
#!/bin/bash TMPFILE="/tmp/script.pig" FUN=$1 # pass the pig function as first parameter TAB=$2 # pass the column as second parameter cat <<EOF > "$TMPFILE" iris = load '/tmp/iris.data' using PigStorage(',') as (sl:double, sw:double, pl:double, pw:double, species:chararray); by_species = group iris by species; result = foreach by_species generate group as species, $FUN(iris.$TAB); dump result; EOF pig -x tez "$TMPFILE"
You can call it e.g. as
bash ./test.sh MAX sw
to get the maximum of column "sw", or
bash ./test.sh AVG sl
to get the average of column "sl"
Created 05-09-2016 04:05 PM
Thank you Bernhard Walter, for your suggestions.
But I have some more for you please:
1. Given that I should have the ability to upload multiple files one after another, the schemes will be totally different. How can I manage case columns with the 'as'?
2. Can I replace TAB = $ 2 by TAB = $* in order to spend several column parameter?
Thank you !
Created 05-09-2016 04:36 PM
Well, this goes now into bash programming: The part between the lines "cat << EOF" and "EOF" is a so called "here doc" that writes the actual pig script. Everything starting with $ is a variable ($0,$1,... are predefined in bash with $0 containing the script/program name, $1 the first actual parameter, $2 the second and so on; $@ gives back all provided parameters joined with a space ' ' by default). Note: Setting variables (e.g. FUN, TAB) is done without $, referencing with $
So you can add any logic before "cat << EOF" to set variables leveraging your input parameters and reference them in the "here doc" to get the pig script you want.
For more see e.g. http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO.html and http://www.tldp.org/LDP/abs/html/index.html (and at many other locations).