- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Options for decompressing HDFS data using Pig
- Labels:
-
Apache Pig
Created ‎03-23-2016 04:34 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If I use a pig script like the one described below, I am able to leverage mapreduce to compress a ton of data and I get a pretty good ratio.
However, when I try to de-compress the data, I lose the individual files.
For example, if my original, uncompressed folder has a.dat through z.dat, the compressed folder will have something like part-m-00001.bz2, part-m-00002.bz2, etc
That's fine.
But then, when I try to do the same thing in reverse, to get back my original content, I just get larger files that look like part-m-00001, part-m-00002, etc.
Is there a way to leverage our cluster to compress HDFS files in such a way that I can get back the original files - including the file name?
Thanks!
set output.compression.enabled true;
set output.compression.codec org.apache.hadoop.io.compress.BZip2Codec;
InputFiles = LOAD '/my/hdfs/path/' using PigStorage();
STORE InputFiles INTO '/my/hdfs/path_compressed/' USING PigStorage();
Created ‎03-23-2016 04:39 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Not easily. MapReduce by definition groups files together as it pleases and then writes one output file per mapper/reducer. Those are the part files.
Pig will not accomodate what you want the whole stack is designed to put an abstraction layer over the data files it reads.
What you could do is something like hadoop streaming or to write your own inputformat that somehow forwards the data to the Reducers. However that will not be straightforward.
https://hadoop.apache.org/docs/r1.2.1/streaming.html#How+do+I+process+files%2C+one+per+map%3F
So short answer. While possible it is not easy. Sorry.
Created ‎03-23-2016 04:39 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Not easily. MapReduce by definition groups files together as it pleases and then writes one output file per mapper/reducer. Those are the part files.
Pig will not accomodate what you want the whole stack is designed to put an abstraction layer over the data files it reads.
What you could do is something like hadoop streaming or to write your own inputformat that somehow forwards the data to the Reducers. However that will not be straightforward.
https://hadoop.apache.org/docs/r1.2.1/streaming.html#How+do+I+process+files%2C+one+per+map%3F
So short answer. While possible it is not easy. Sorry.
Created ‎03-23-2016 08:14 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Benjamin
Created ‎03-24-2016 11:51 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you want to investigate this more there is a Hadoop streaming example in the Hadoop the definitive guide book , which might be of help. ( they get a list of files and then spin of reducers based of the files and run some Linux commands in the Reducers. You could essentially do anything you want. )
