Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Where do I get references for PiggyBank.

avatar
Explorer

Hi,

I was answering the practice exam and encountered a question where I had to store a PIG result in multiple files. This was a surprise question as I could not find any reference for this kind of storage in PIG documentation. Initially, I thought of using SPLIT which was very close to multiple file storage. But, when I googled about it I encountered a function 'MultiStorage()' which would serve the purpose.

So, where can I find such methods/functions and their usages? Can you please help me with the piggybank references that are documented so far?

Regards

Saurabh

1 ACCEPTED SOLUTION

avatar
Guru

Many of the Pig operators have a PARALLEL option for specifying the number of reducers, which also determines the number of output files. For the intent of the practice exam and the real exam, using PARALLEL is all you need to accomplish this task.

View solution in original post

5 REPLIES 5

avatar
Master Guru

Had that problem before. I didn't find any great websites around it. However the source code of the piggybank functions contains some really good documentation in the javadocs.

https://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/MultiStorage.html

or directly the source code, many of the functions are pretty straight forward to understand from code:

http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank...

I didn't find anything better, doesn't mean that it doesn't exist.

avatar
Master Mentor

Similar question has been asked before https://community.hortonworks.com/questions/20487/store-output-file-as-3-files-using-pig.html

I am going to repeat my findings over @Rich Raposa answer. It's only relevant if it's not for the purposes of the exam. This question was bothering me and I needed to try it out.

here's a full script, piggybank is both in pig-client/lib and in pig-client directory

REGISTER /usr/hdp/current/pig-client/piggybank.jar;
A = LOAD 'data2' USING PigStorage() as (url, count);
fs -rm -R output;
STORE A INTO 'output' USING org.apache.pig.piggybank.storage.MultiStorage('output', '0');

dataset is

1
2
3
4
5

output is

-rw-r--r--   3 root hdfs          3 2016-03-18 01:51 /user/root/output/1/1-0,000
Found 1 items
-rw-r--r--   3 root hdfs          3 2016-03-18 01:51 /user/root/output/2/2-0,000
Found 1 items
-rw-r--r--   3 root hdfs          3 2016-03-18 01:51 /user/root/output/3/3-0,000
Found 1 items
-rw-r--r--   3 root hdfs          3 2016-03-18 01:51 /user/root/output/4/4-0,000
Found 1 items
-rw-r--r--   3 root hdfs          3 2016-03-18 01:51 /user/root/output/5/5-0,000
-rw-r--r--   3 root hdfs          0 2016-03-18 01:51 /user/root/output/_SUCCESS


each file has one line

[root@sandbox ~]# hdfs dfs -cat /user/root/output/5/5-0,000
5

in case of @Rich Raposa example

the output directory would look like so:

[root@sandbox ~]# hdfs dfs -ls output3
Found 6 items
-rw-r--r--   3 root hdfs          0 2016-03-18 01:59 output3/_SUCCESS
-rw-r--r--   3 root hdfs          3 2016-03-18 01:59 output3/part-v003-o000-r-00000
-rw-r--r--   3 root hdfs          3 2016-03-18 01:59 output3/part-v003-o000-r-00001
-rw-r--r--   3 root hdfs          3 2016-03-18 01:59 output3/part-v003-o000-r-00002
-rw-r--r--   3 root hdfs          3 2016-03-18 01:59 output3/part-v003-o000-r-00003
-rw-r--r--   3 root hdfs          3 2016-03-18 01:59 output3/part-v003-o000-r-00004

which means with PARALLEL it creates multiple files within the same directory. In terms of MultiStorage, it created a separate directory and separate file. Additionally with MultiStorage you can pass compression, granted it's bz2, gz, no snappy and delimiter. It's clunky and documentation is not the best but if you need that type of control, it's an option.

avatar
Guru

Many of the Pig operators have a PARALLEL option for specifying the number of reducers, which also determines the number of output files. For the intent of the practice exam and the real exam, using PARALLEL is all you need to accomplish this task.

avatar
Explorer

@Rich Raposa,

Not sure on how to use PARALLEL on STORE command. I see PARALLEL option for GROUP, COGROUP, CROSS, DISTINCT, etc., I did not find it for STORE. Could you please me with an example? My exam is on this Monday and an example would be of great help at this point of time.

Thanks

Saurabh

avatar
Guru

Sure - the following simple script uses 3 reducers on the last operation, so there will be 3 output files:

a = load 'something';
b = order a by $1 parallel 3;
store b into 'somewhere';

PARALLEL is not an option on STORE, but it is an option on a lot of other Pig operations.