question Re: store multiple files using pig in Archives of Support Questions (Read Only)

store multiple files using pig

Santanu — Sat, 17 Mar 2018 22:21:10 GMT

Hi Friends,

I was practicing aws tasks for HDPCD exam. There is one question for which I need help. I am describing it briefly.

"From a Pig Script, Store the output as 3 Comma Separated files in HDFS directory"

Now, I used below command for that.

STORE output INTO '<hdfs directory>' USING PigStorage(',') PARALLEL 3;

It was running 3 reducers, but eventually stored only one part-r-00000 file to output hdfs path with all rows.

So, what is the simplest way to store 3 output comma separated files from Pig ? [ without using any additional jar file ]

Thanking you

Santanu

Re: store multiple files using pig

asirna — Sat, 17 Mar 2018 23:42:31 GMT

@Santanu Ghosh,

You have to use PARALLEL with any operator that starts a reduce phase like GROUP, JOIN, CROSS, DISTINCT etc.

I have mentioned usage of PARALLEL with an example data set

1) Put the data.csv into hdfs

[qa@vnode-68 root]$ hdfs dfs -put data.csv /user/qa/

2) Check the content of the data file

[qa@vnode-68 root]$ hdfs dfs -cat /user/qa/data.csv
abhi,34,brown,5
john,35,green,6
amy,30,brown,6
Steve,38,blue,6
Brett,35,brown,6
Andy,34,brown,6

3) Run the pig script which will group users by color and dump the output into hdfs

[qa@vnode-68 root]$ pig
grunt> data = LOAD '/user/qa/data.csv' using PigStorage(',') as (name:chararray,age:int, color:chararray,height:int);
grunt> b = group data by color parallel 3;
grunt> store b into '/user/qa/new' using PigStorage(',');
grunt> quit

4) Check the output folder to make sure that 3 files are created

[qa@vnode-68 root]$ hdfs dfs -ls /user/qa/new
Found 4 items
-rw-r--r--   3 qa hdfs          0 2018-03-17 16:28 /user/qa/new/_SUCCESS
-rw-r--r--   3 qa hdfs          0 2018-03-17 16:28 /user/qa/new/part-r-00000
-rw-r--r--   3 qa hdfs         80 2018-03-17 16:28 /user/qa/new/part-r-00001
-rw-r--r--   3 qa hdfs         51 2018-03-17 16:28 /user/qa/new/part-r-00002

Additional reference : https://pig.apache.org/docs/r0.15.0/perf.html#parallel

If this helps you, please click on the Accept button to accept the answer. This will be really useful for other community users.

-Aditya

Re: store multiple files using pig

Santanu — Sun, 18 Mar 2018 12:27:02 GMT

Thanks @Aditya Sirna for your response. It's working. I used this command.

relation_1 = ORDER relation_0 BY <col_2> DESC PARALLEL 3;

Thanking you

Santanu