question Options for decompressing HDFS data using Pig in Archives of Support Questions (Read Only)

Options for decompressing HDFS data using Pig

zack_riesland — Wed, 23 Mar 2016 23:34:59 GMT

If I use a pig script like the one described below, I am able to leverage mapreduce to compress a ton of data and I get a pretty good ratio.

However, when I try to de-compress the data, I lose the individual files.

For example, if my original, uncompressed folder has a.dat through z.dat, the compressed folder will have something like part-m-00001.bz2, part-m-00002.bz2, etc

That's fine.

But then, when I try to do the same thing in reverse, to get back my original content, I just get larger files that look like part-m-00001, part-m-00002, etc.

Is there a way to leverage our cluster to compress HDFS files in such a way that I can get back the original files - including the file name?

Thanks!

set output.compression.enabled true;

set output.compression.codec org.apache.hadoop.io.compress.BZip2Codec;

InputFiles = LOAD '/my/hdfs/path/' using PigStorage();

STORE InputFiles INTO '/my/hdfs/path_compressed/' USING PigStorage();

Re: Options for decompressing HDFS data using Pig

bleonhardi — Wed, 23 Mar 2016 23:39:54 GMT

Not easily. MapReduce by definition groups files together as it pleases and then writes one output file per mapper/reducer. Those are the part files.

Pig will not accomodate what you want the whole stack is designed to put an abstraction layer over the data files it reads.

What you could do is something like hadoop streaming or to write your own inputformat that somehow forwards the data to the Reducers. However that will not be straightforward.

https://hadoop.apache.org/docs/r1.2.1/streaming.html#How+do+I+process+files%2C+one+per+map%3F

So short answer. While possible it is not easy. Sorry.

Re: Options for decompressing HDFS data using Pig

zack_riesland — Thu, 24 Mar 2016 03:14:49 GMT

Thanks Benjamin

Re: Options for decompressing HDFS data using Pig

bleonhardi — Thu, 24 Mar 2016 18:51:03 GMT

If you want to investigate this more there is a Hadoop streaming example in the Hadoop the definitive guide book , which might be of help. ( they get a list of files and then spin of reducers based of the files and run some Linux commands in the Reducers. You could essentially do anything you want. )