Support Questions

get2noesks · ‎03-11-2016

Hi,

I was answering the practice exam and encountered a question where I had to store a PIG result in multiple files. This was a surprise question as I could not find any reference for this kind of storage in PIG documentation. Initially, I thought of using SPLIT which was very close to multiple file storage. But, when I googled about it I encountered a function 'MultiStorage()' which would serve the purpose.

So, where can I find such methods/functions and their usages? Can you please help me with the piggybank references that are documented so far?

Regards

Saurabh

rich1 · ‎03-11-2016

Many of the Pig operators have a PARALLEL option for specifying the number of reducers, which also determines the number of output files. For the intent of the practice exam and the real exam, using PARALLEL is all you need to accomplish this task.

View solution in original post

bleonhardi · ‎03-11-2016

Had that problem before. I didn't find any great websites around it. However the source code of the piggybank functions contains some really good documentation in the javadocs.

https://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/MultiStorage.html

or directly the source code, many of the functions are pretty straight forward to understand from code:

http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank...

I didn't find anything better, doesn't mean that it doesn't exist.

aervits · ‎03-11-2016

Similar question has been asked before https://community.hortonworks.com/questions/20487/store-output-file-as-3-files-using-pig.html

I am going to repeat my findings over @Rich Raposa answer. It's only relevant if it's not for the purposes of the exam. This question was bothering me and I needed to try it out.

here's a full script, piggybank is both in pig-client/lib and in pig-client directory

REGISTER /usr/hdp/current/pig-client/piggybank.jar;
A = LOAD 'data2' USING PigStorage() as (url, count);
fs -rm -R output;
STORE A INTO 'output' USING org.apache.pig.piggybank.storage.MultiStorage('output', '0');

dataset is

output is

-rw-r--r--   3 root hdfs          3 2016-03-18 01:51 /user/root/output/1/1-0,000
Found 1 items
-rw-r--r--   3 root hdfs          3 2016-03-18 01:51 /user/root/output/2/2-0,000
Found 1 items
-rw-r--r--   3 root hdfs          3 2016-03-18 01:51 /user/root/output/3/3-0,000
Found 1 items
-rw-r--r--   3 root hdfs          3 2016-03-18 01:51 /user/root/output/4/4-0,000
Found 1 items
-rw-r--r--   3 root hdfs          3 2016-03-18 01:51 /user/root/output/5/5-0,000
-rw-r--r--   3 root hdfs          0 2016-03-18 01:51 /user/root/output/_SUCCESS

each file has one line

[root@sandbox ~]# hdfs dfs -cat /user/root/output/5/5-0,000
5

in case of @Rich Raposa example

the output directory would look like so:

[root@sandbox ~]# hdfs dfs -ls output3
Found 6 items
-rw-r--r--   3 root hdfs          0 2016-03-18 01:59 output3/_SUCCESS
-rw-r--r--   3 root hdfs          3 2016-03-18 01:59 output3/part-v003-o000-r-00000
-rw-r--r--   3 root hdfs          3 2016-03-18 01:59 output3/part-v003-o000-r-00001
-rw-r--r--   3 root hdfs          3 2016-03-18 01:59 output3/part-v003-o000-r-00002
-rw-r--r--   3 root hdfs          3 2016-03-18 01:59 output3/part-v003-o000-r-00003
-rw-r--r--   3 root hdfs          3 2016-03-18 01:59 output3/part-v003-o000-r-00004

which means with PARALLEL it creates multiple files within the same directory. In terms of MultiStorage, it created a separate directory and separate file. Additionally with MultiStorage you can pass compression, granted it's bz2, gz, no snappy and delimiter. It's clunky and documentation is not the best but if you need that type of control, it's an option.

rich1 · ‎03-11-2016

Many of the Pig operators have a PARALLEL option for specifying the number of reducers, which also determines the number of output files. For the intent of the practice exam and the real exam, using PARALLEL is all you need to accomplish this task.

get2noesks · ‎03-11-2016

@Rich Raposa,

Not sure on how to use PARALLEL on STORE command. I see PARALLEL option for GROUP, COGROUP, CROSS, DISTINCT, etc., I did not find it for STORE. Could you please me with an example? My exam is on this Monday and an example would be of great help at this point of time.

Thanks

Saurabh

rich1 · ‎03-11-2016

Sure - the following simple script uses 3 reducers on the last operation, so there will be 3 output files:

a = load 'something';
b = order a by $1 parallel 3;
store b into 'somewhere';

PARALLEL is not an option on STORE, but it is an option on a lot of other Pig operations.

Cloudera Community

Support Questions

Where do I get references for PiggyBank.