Created 03-17-2018 03:21 PM
Hi Friends,
I was practicing aws tasks for HDPCD exam. There is one question for which I need help. I am describing it briefly.
"From a Pig Script, Store the output as 3 Comma Separated files in HDFS directory"
Now, I used below command for that.
STORE output INTO '<hdfs directory>' USING PigStorage(',') PARALLEL 3;
It was running 3 reducers, but eventually stored only one part-r-00000 file to output hdfs path with all rows.
So, what is the simplest way to store 3 output comma separated files from Pig ? [ without using any additional jar file ]
Thanking you
Santanu
Created 03-17-2018 04:42 PM
You have to use PARALLEL with any operator that starts a reduce phase like GROUP, JOIN, CROSS, DISTINCT etc.
I have mentioned usage of PARALLEL with an example data set
1) Put the data.csv into hdfs
[qa@vnode-68 root]$ hdfs dfs -put data.csv /user/qa/
.
2) Check the content of the data file
[qa@vnode-68 root]$ hdfs dfs -cat /user/qa/data.csv abhi,34,brown,5 john,35,green,6 amy,30,brown,6 Steve,38,blue,6 Brett,35,brown,6 Andy,34,brown,6
.
3) Run the pig script which will group users by color and dump the output into hdfs
[qa@vnode-68 root]$ pig grunt> data = LOAD '/user/qa/data.csv' using PigStorage(',') as (name:chararray,age:int, color:chararray,height:int); grunt> b = group data by color parallel 3; grunt> store b into '/user/qa/new' using PigStorage(','); grunt> quit
.
4) Check the output folder to make sure that 3 files are created
[qa@vnode-68 root]$ hdfs dfs -ls /user/qa/new Found 4 items -rw-r--r-- 3 qa hdfs 0 2018-03-17 16:28 /user/qa/new/_SUCCESS -rw-r--r-- 3 qa hdfs 0 2018-03-17 16:28 /user/qa/new/part-r-00000 -rw-r--r-- 3 qa hdfs 80 2018-03-17 16:28 /user/qa/new/part-r-00001 -rw-r--r-- 3 qa hdfs 51 2018-03-17 16:28 /user/qa/new/part-r-00002
Additional reference : https://pig.apache.org/docs/r0.15.0/perf.html#parallel
If this helps you, please click on the Accept button to accept the answer. This will be really useful for other community users.
.
-Aditya
Created 03-17-2018 04:42 PM
You have to use PARALLEL with any operator that starts a reduce phase like GROUP, JOIN, CROSS, DISTINCT etc.
I have mentioned usage of PARALLEL with an example data set
1) Put the data.csv into hdfs
[qa@vnode-68 root]$ hdfs dfs -put data.csv /user/qa/
.
2) Check the content of the data file
[qa@vnode-68 root]$ hdfs dfs -cat /user/qa/data.csv abhi,34,brown,5 john,35,green,6 amy,30,brown,6 Steve,38,blue,6 Brett,35,brown,6 Andy,34,brown,6
.
3) Run the pig script which will group users by color and dump the output into hdfs
[qa@vnode-68 root]$ pig grunt> data = LOAD '/user/qa/data.csv' using PigStorage(',') as (name:chararray,age:int, color:chararray,height:int); grunt> b = group data by color parallel 3; grunt> store b into '/user/qa/new' using PigStorage(','); grunt> quit
.
4) Check the output folder to make sure that 3 files are created
[qa@vnode-68 root]$ hdfs dfs -ls /user/qa/new Found 4 items -rw-r--r-- 3 qa hdfs 0 2018-03-17 16:28 /user/qa/new/_SUCCESS -rw-r--r-- 3 qa hdfs 0 2018-03-17 16:28 /user/qa/new/part-r-00000 -rw-r--r-- 3 qa hdfs 80 2018-03-17 16:28 /user/qa/new/part-r-00001 -rw-r--r-- 3 qa hdfs 51 2018-03-17 16:28 /user/qa/new/part-r-00002
Additional reference : https://pig.apache.org/docs/r0.15.0/perf.html#parallel
If this helps you, please click on the Accept button to accept the answer. This will be really useful for other community users.
.
-Aditya
Created 03-18-2018 05:27 AM
Thanks @Aditya Sirna for your response. It's working. I used this command.
relation_1 = ORDER relation_0 BY <col_2> DESC PARALLEL 3;
Thanking you
Santanu