Support Questions

Santanu · ‎03-17-2018

Hi Friends,

I was practicing aws tasks for HDPCD exam. There is one question for which I need help. I am describing it briefly.

"From a Pig Script, Store the output as 3 Comma Separated files in HDFS directory"

Now, I used below command for that.

STORE output INTO '<hdfs directory>' USING PigStorage(',') PARALLEL 3;

It was running 3 reducers, but eventually stored only one part-r-00000 file to output hdfs path with all rows.

So, what is the simplest way to store 3 output comma separated files from Pig ? [ without using any additional jar file ]

Thanking you

Santanu

asirna · ‎03-17-2018

@Santanu Ghosh,

You have to use PARALLEL with any operator that starts a reduce phase like GROUP, JOIN, CROSS, DISTINCT etc.

I have mentioned usage of PARALLEL with an example data set

1) Put the data.csv into hdfs

[qa@vnode-68 root]$ hdfs dfs -put data.csv /user/qa/

.

2) Check the content of the data file

[qa@vnode-68 root]$ hdfs dfs -cat /user/qa/data.csv
abhi,34,brown,5
john,35,green,6
amy,30,brown,6
Steve,38,blue,6
Brett,35,brown,6
Andy,34,brown,6

.

3) Run the pig script which will group users by color and dump the output into hdfs

[qa@vnode-68 root]$ pig
grunt> data = LOAD '/user/qa/data.csv' using PigStorage(',') as (name:chararray,age:int, color:chararray,height:int);
grunt> b = group data by color parallel 3;
grunt> store b into '/user/qa/new' using PigStorage(',');
grunt> quit

.

4) Check the output folder to make sure that 3 files are created

[qa@vnode-68 root]$ hdfs dfs -ls /user/qa/new
Found 4 items
-rw-r--r--   3 qa hdfs          0 2018-03-17 16:28 /user/qa/new/_SUCCESS
-rw-r--r--   3 qa hdfs          0 2018-03-17 16:28 /user/qa/new/part-r-00000
-rw-r--r--   3 qa hdfs         80 2018-03-17 16:28 /user/qa/new/part-r-00001
-rw-r--r--   3 qa hdfs         51 2018-03-17 16:28 /user/qa/new/part-r-00002

Additional reference : https://pig.apache.org/docs/r0.15.0/perf.html#parallel

If this helps you, please click on the Accept button to accept the answer. This will be really useful for other community users.

.

-Aditya

View solution in original post

asirna · ‎03-17-2018

@Santanu Ghosh,

You have to use PARALLEL with any operator that starts a reduce phase like GROUP, JOIN, CROSS, DISTINCT etc.

I have mentioned usage of PARALLEL with an example data set

1) Put the data.csv into hdfs

[qa@vnode-68 root]$ hdfs dfs -put data.csv /user/qa/

.

2) Check the content of the data file

[qa@vnode-68 root]$ hdfs dfs -cat /user/qa/data.csv
abhi,34,brown,5
john,35,green,6
amy,30,brown,6
Steve,38,blue,6
Brett,35,brown,6
Andy,34,brown,6

.

3) Run the pig script which will group users by color and dump the output into hdfs

[qa@vnode-68 root]$ pig
grunt> data = LOAD '/user/qa/data.csv' using PigStorage(',') as (name:chararray,age:int, color:chararray,height:int);
grunt> b = group data by color parallel 3;
grunt> store b into '/user/qa/new' using PigStorage(',');
grunt> quit

.

4) Check the output folder to make sure that 3 files are created

[qa@vnode-68 root]$ hdfs dfs -ls /user/qa/new
Found 4 items
-rw-r--r--   3 qa hdfs          0 2018-03-17 16:28 /user/qa/new/_SUCCESS
-rw-r--r--   3 qa hdfs          0 2018-03-17 16:28 /user/qa/new/part-r-00000
-rw-r--r--   3 qa hdfs         80 2018-03-17 16:28 /user/qa/new/part-r-00001
-rw-r--r--   3 qa hdfs         51 2018-03-17 16:28 /user/qa/new/part-r-00002

Additional reference : https://pig.apache.org/docs/r0.15.0/perf.html#parallel

If this helps you, please click on the Accept button to accept the answer. This will be really useful for other community users.

.

-Aditya

Santanu · ‎03-18-2018

Thanks @Aditya Sirna for your response. It's working. I used this command.

relation_1 = ORDER relation_0 BY <col_2> DESC PARALLEL 3;

Thanking you

Santanu

Cloudera Community

Support Questions

store multiple files using pig

Store output file as 3 files using pig

MergeRecord generates multiple files

How to merge multiple HDFS files using Nifi Proces...

STORE Pig OUTPUT into MULTIPLE HBase TABLES

Using Hadoop Credential API to store AWS secrets

pig and hive store

Pig output STORE using elephantbird.pig.store.LzoJ...

Text file to Parquet file conversion using Pig

Pig script/command to filter multiple files on par...

Import HBase data in csv format using pig