Created 03-23-2016 04:34 PM
If I use a pig script like the one described below, I am able to leverage mapreduce to compress a ton of data and I get a pretty good ratio.
However, when I try to de-compress the data, I lose the individual files.
For example, if my original, uncompressed folder has a.dat through z.dat, the compressed folder will have something like part-m-00001.bz2, part-m-00002.bz2, etc
That's fine.
But then, when I try to do the same thing in reverse, to get back my original content, I just get larger files that look like part-m-00001, part-m-00002, etc.
Is there a way to leverage our cluster to compress HDFS files in such a way that I can get back the original files - including the file name?
Thanks!
set output.compression.enabled true;
set output.compression.codec org.apache.hadoop.io.compress.BZip2Codec;
InputFiles = LOAD '/my/hdfs/path/' using PigStorage();
STORE InputFiles INTO '/my/hdfs/path_compressed/' USING PigStorage();
Created 03-23-2016 04:39 PM
Not easily. MapReduce by definition groups files together as it pleases and then writes one output file per mapper/reducer. Those are the part files.
Pig will not accomodate what you want the whole stack is designed to put an abstraction layer over the data files it reads.
What you could do is something like hadoop streaming or to write your own inputformat that somehow forwards the data to the Reducers. However that will not be straightforward.
https://hadoop.apache.org/docs/r1.2.1/streaming.html#How+do+I+process+files%2C+one+per+map%3F
So short answer. While possible it is not easy. Sorry.
Created 03-23-2016 04:39 PM
Not easily. MapReduce by definition groups files together as it pleases and then writes one output file per mapper/reducer. Those are the part files.
Pig will not accomodate what you want the whole stack is designed to put an abstraction layer over the data files it reads.
What you could do is something like hadoop streaming or to write your own inputformat that somehow forwards the data to the Reducers. However that will not be straightforward.
https://hadoop.apache.org/docs/r1.2.1/streaming.html#How+do+I+process+files%2C+one+per+map%3F
So short answer. While possible it is not easy. Sorry.
Created 03-23-2016 08:14 PM
Thanks Benjamin
Created 03-24-2016 11:51 AM
If you want to investigate this more there is a Hadoop streaming example in the Hadoop the definitive guide book , which might be of help. ( they get a list of files and then spin of reducers based of the files and run some Linux commands in the Reducers. You could essentially do anything you want. )