Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Options for decompressing HDFS data using Pig

avatar
Super Collaborator

If I use a pig script like the one described below, I am able to leverage mapreduce to compress a ton of data and I get a pretty good ratio.

However, when I try to de-compress the data, I lose the individual files.

For example, if my original, uncompressed folder has a.dat through z.dat, the compressed folder will have something like part-m-00001.bz2, part-m-00002.bz2, etc

That's fine.

But then, when I try to do the same thing in reverse, to get back my original content, I just get larger files that look like part-m-00001, part-m-00002, etc.

Is there a way to leverage our cluster to compress HDFS files in such a way that I can get back the original files - including the file name?

Thanks!

set output.compression.enabled true;

set output.compression.codec org.apache.hadoop.io.compress.BZip2Codec;

InputFiles = LOAD '/my/hdfs/path/' using PigStorage();

STORE InputFiles INTO '/my/hdfs/path_compressed/' USING PigStorage();

1 ACCEPTED SOLUTION

avatar
Master Guru

Not easily. MapReduce by definition groups files together as it pleases and then writes one output file per mapper/reducer. Those are the part files.

Pig will not accomodate what you want the whole stack is designed to put an abstraction layer over the data files it reads.

What you could do is something like hadoop streaming or to write your own inputformat that somehow forwards the data to the Reducers. However that will not be straightforward.

https://hadoop.apache.org/docs/r1.2.1/streaming.html#How+do+I+process+files%2C+one+per+map%3F

So short answer. While possible it is not easy. Sorry.

View solution in original post

3 REPLIES 3

avatar
Master Guru

Not easily. MapReduce by definition groups files together as it pleases and then writes one output file per mapper/reducer. Those are the part files.

Pig will not accomodate what you want the whole stack is designed to put an abstraction layer over the data files it reads.

What you could do is something like hadoop streaming or to write your own inputformat that somehow forwards the data to the Reducers. However that will not be straightforward.

https://hadoop.apache.org/docs/r1.2.1/streaming.html#How+do+I+process+files%2C+one+per+map%3F

So short answer. While possible it is not easy. Sorry.

avatar
Super Collaborator

Thanks Benjamin

avatar
Master Guru

If you want to investigate this more there is a Hadoop streaming example in the Hadoop the definitive guide book , which might be of help. ( they get a list of files and then spin of reducers based of the files and run some Linux commands in the Reducers. You could essentially do anything you want. )