Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

store multiple files using pig

avatar
Contributor

Hi Friends,

I was practicing aws tasks for HDPCD exam. There is one question for which I need help. I am describing it briefly.

"From a Pig Script, Store the output as 3 Comma Separated files in HDFS directory"

Now, I used below command for that.

STORE output INTO '<hdfs directory>' USING PigStorage(',') PARALLEL 3;

It was running 3 reducers, but eventually stored only one part-r-00000 file to output hdfs path with all rows.

So, what is the simplest way to store 3 output comma separated files from Pig ? [ without using any additional jar file ]

Thanking you

Santanu

1 ACCEPTED SOLUTION

avatar
Super Guru

@Santanu Ghosh,

You have to use PARALLEL with any operator that starts a reduce phase like GROUP, JOIN, CROSS, DISTINCT etc.

I have mentioned usage of PARALLEL with an example data set

1) Put the data.csv into hdfs

[qa@vnode-68 root]$ hdfs dfs -put data.csv /user/qa/

.

2) Check the content of the data file

[qa@vnode-68 root]$ hdfs dfs -cat /user/qa/data.csv
abhi,34,brown,5
john,35,green,6
amy,30,brown,6
Steve,38,blue,6
Brett,35,brown,6
Andy,34,brown,6

.

3) Run the pig script which will group users by color and dump the output into hdfs

[qa@vnode-68 root]$ pig
grunt> data = LOAD '/user/qa/data.csv' using PigStorage(',') as (name:chararray,age:int, color:chararray,height:int);
grunt> b = group data by color parallel 3;
grunt> store b into '/user/qa/new' using PigStorage(',');
grunt> quit

.

4) Check the output folder to make sure that 3 files are created

[qa@vnode-68 root]$ hdfs dfs -ls /user/qa/new
Found 4 items
-rw-r--r--   3 qa hdfs          0 2018-03-17 16:28 /user/qa/new/_SUCCESS
-rw-r--r--   3 qa hdfs          0 2018-03-17 16:28 /user/qa/new/part-r-00000
-rw-r--r--   3 qa hdfs         80 2018-03-17 16:28 /user/qa/new/part-r-00001
-rw-r--r--   3 qa hdfs         51 2018-03-17 16:28 /user/qa/new/part-r-00002

Additional reference : https://pig.apache.org/docs/r0.15.0/perf.html#parallel

If this helps you, please click on the Accept button to accept the answer. This will be really useful for other community users.

.

-Aditya

View solution in original post

2 REPLIES 2

avatar
Super Guru

@Santanu Ghosh,

You have to use PARALLEL with any operator that starts a reduce phase like GROUP, JOIN, CROSS, DISTINCT etc.

I have mentioned usage of PARALLEL with an example data set

1) Put the data.csv into hdfs

[qa@vnode-68 root]$ hdfs dfs -put data.csv /user/qa/

.

2) Check the content of the data file

[qa@vnode-68 root]$ hdfs dfs -cat /user/qa/data.csv
abhi,34,brown,5
john,35,green,6
amy,30,brown,6
Steve,38,blue,6
Brett,35,brown,6
Andy,34,brown,6

.

3) Run the pig script which will group users by color and dump the output into hdfs

[qa@vnode-68 root]$ pig
grunt> data = LOAD '/user/qa/data.csv' using PigStorage(',') as (name:chararray,age:int, color:chararray,height:int);
grunt> b = group data by color parallel 3;
grunt> store b into '/user/qa/new' using PigStorage(',');
grunt> quit

.

4) Check the output folder to make sure that 3 files are created

[qa@vnode-68 root]$ hdfs dfs -ls /user/qa/new
Found 4 items
-rw-r--r--   3 qa hdfs          0 2018-03-17 16:28 /user/qa/new/_SUCCESS
-rw-r--r--   3 qa hdfs          0 2018-03-17 16:28 /user/qa/new/part-r-00000
-rw-r--r--   3 qa hdfs         80 2018-03-17 16:28 /user/qa/new/part-r-00001
-rw-r--r--   3 qa hdfs         51 2018-03-17 16:28 /user/qa/new/part-r-00002

Additional reference : https://pig.apache.org/docs/r0.15.0/perf.html#parallel

If this helps you, please click on the Accept button to accept the answer. This will be really useful for other community users.

.

-Aditya

avatar
Contributor

Thanks @Aditya Sirna for your response. It's working. I used this command.

relation_1 = ORDER relation_0 BY <col_2> DESC PARALLEL 3;

Thanking you

Santanu