Created 03-11-2016 05:58 AM
Hi,
I was answering the practice exam and encountered a question where I had to store a PIG result in multiple files. This was a surprise question as I could not find any reference for this kind of storage in PIG documentation. Initially, I thought of using SPLIT which was very close to multiple file storage. But, when I googled about it I encountered a function 'MultiStorage()' which would serve the purpose.
So, where can I find such methods/functions and their usages? Can you please help me with the piggybank references that are documented so far?
Regards
Saurabh
Created 03-11-2016 12:09 PM
Many of the Pig operators have a PARALLEL option for specifying the number of reducers, which also determines the number of output files. For the intent of the practice exam and the real exam, using PARALLEL is all you need to accomplish this task.
Created 03-11-2016 09:21 AM
Had that problem before. I didn't find any great websites around it. However the source code of the piggybank functions contains some really good documentation in the javadocs.
https://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/MultiStorage.html
or directly the source code, many of the functions are pretty straight forward to understand from code:
I didn't find anything better, doesn't mean that it doesn't exist.
Created 03-11-2016 11:52 AM
Similar question has been asked before https://community.hortonworks.com/questions/20487/store-output-file-as-3-files-using-pig.html
I am going to repeat my findings over @Rich Raposa answer. It's only relevant if it's not for the purposes of the exam. This question was bothering me and I needed to try it out.
here's a full script, piggybank is both in pig-client/lib and in pig-client directory
REGISTER /usr/hdp/current/pig-client/piggybank.jar; A = LOAD 'data2' USING PigStorage() as (url, count); fs -rm -R output; STORE A INTO 'output' USING org.apache.pig.piggybank.storage.MultiStorage('output', '0');
dataset is
1 2 3 4 5
output is
-rw-r--r-- 3 root hdfs 3 2016-03-18 01:51 /user/root/output/1/1-0,000 Found 1 items -rw-r--r-- 3 root hdfs 3 2016-03-18 01:51 /user/root/output/2/2-0,000 Found 1 items -rw-r--r-- 3 root hdfs 3 2016-03-18 01:51 /user/root/output/3/3-0,000 Found 1 items -rw-r--r-- 3 root hdfs 3 2016-03-18 01:51 /user/root/output/4/4-0,000 Found 1 items -rw-r--r-- 3 root hdfs 3 2016-03-18 01:51 /user/root/output/5/5-0,000 -rw-r--r-- 3 root hdfs 0 2016-03-18 01:51 /user/root/output/_SUCCESS
each file has one line
[root@sandbox ~]# hdfs dfs -cat /user/root/output/5/5-0,000 5
in case of @Rich Raposa example
the output directory would look like so:
[root@sandbox ~]# hdfs dfs -ls output3 Found 6 items -rw-r--r-- 3 root hdfs 0 2016-03-18 01:59 output3/_SUCCESS -rw-r--r-- 3 root hdfs 3 2016-03-18 01:59 output3/part-v003-o000-r-00000 -rw-r--r-- 3 root hdfs 3 2016-03-18 01:59 output3/part-v003-o000-r-00001 -rw-r--r-- 3 root hdfs 3 2016-03-18 01:59 output3/part-v003-o000-r-00002 -rw-r--r-- 3 root hdfs 3 2016-03-18 01:59 output3/part-v003-o000-r-00003 -rw-r--r-- 3 root hdfs 3 2016-03-18 01:59 output3/part-v003-o000-r-00004
which means with PARALLEL it creates multiple files within the same directory. In terms of MultiStorage, it created a separate directory and separate file. Additionally with MultiStorage you can pass compression, granted it's bz2, gz, no snappy and delimiter. It's clunky and documentation is not the best but if you need that type of control, it's an option.
Created 03-11-2016 12:09 PM
Many of the Pig operators have a PARALLEL option for specifying the number of reducers, which also determines the number of output files. For the intent of the practice exam and the real exam, using PARALLEL is all you need to accomplish this task.
Created 03-11-2016 06:49 PM
@Rich Raposa,
Not sure on how to use PARALLEL on STORE command. I see PARALLEL option for GROUP, COGROUP, CROSS, DISTINCT, etc., I did not find it for STORE. Could you please me with an example? My exam is on this Monday and an example would be of great help at this point of time.
Thanks
Saurabh
Created 03-11-2016 07:56 PM
Sure - the following simple script uses 3 reducers on the last operation, so there will be 3 output files:
a = load 'something'; b = order a by $1 parallel 3; store b into 'somewhere';
PARALLEL is not an option on STORE, but it is an option on a lot of other Pig operations.