<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Options for decompressing HDFS data using Pig in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Options-for-decompressing-HDFS-data-using-Pig/m-p/141134#M23606</link>
    <description>&lt;P&gt;If I use a pig script like the one described below, I am able to leverage mapreduce to compress a ton of data and I get a pretty good ratio.&lt;/P&gt;&lt;P&gt;However, when I try to de-compress the data, I lose the individual files. &lt;/P&gt;&lt;P&gt;For example, if my original, uncompressed folder has a.dat through z.dat, the compressed folder will have something like part-m-00001.bz2, part-m-00002.bz2, etc&lt;/P&gt;&lt;P&gt;That's fine.&lt;/P&gt;&lt;P&gt;But then, when I try to do the same thing in reverse, to get back my original content, I just get larger files that look like part-m-00001, part-m-00002, etc.&lt;/P&gt;&lt;P&gt;Is there a way to leverage our cluster to compress HDFS files in such a way that I can get back the original files - including the file name?&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;set
output.compression.enabled &lt;/STRONG&gt;&lt;STRONG&gt;true;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;set
output.compression.codec org.apache.hadoop.io.compress.&lt;/STRONG&gt;&lt;STRONG&gt;BZip2Codec;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;InputFiles = LOAD
'/my/hdfs/path/' using PigStorage();&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;STORE InputFiles INTO
'/my/hdfs/path_compressed/' USING PigStorage();&lt;/STRONG&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 23 Mar 2016 23:34:59 GMT</pubDate>
    <dc:creator>zack_riesland</dc:creator>
    <dc:date>2016-03-23T23:34:59Z</dc:date>
    <item>
      <title>Options for decompressing HDFS data using Pig</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Options-for-decompressing-HDFS-data-using-Pig/m-p/141134#M23606</link>
      <description>&lt;P&gt;If I use a pig script like the one described below, I am able to leverage mapreduce to compress a ton of data and I get a pretty good ratio.&lt;/P&gt;&lt;P&gt;However, when I try to de-compress the data, I lose the individual files. &lt;/P&gt;&lt;P&gt;For example, if my original, uncompressed folder has a.dat through z.dat, the compressed folder will have something like part-m-00001.bz2, part-m-00002.bz2, etc&lt;/P&gt;&lt;P&gt;That's fine.&lt;/P&gt;&lt;P&gt;But then, when I try to do the same thing in reverse, to get back my original content, I just get larger files that look like part-m-00001, part-m-00002, etc.&lt;/P&gt;&lt;P&gt;Is there a way to leverage our cluster to compress HDFS files in such a way that I can get back the original files - including the file name?&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;set
output.compression.enabled &lt;/STRONG&gt;&lt;STRONG&gt;true;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;set
output.compression.codec org.apache.hadoop.io.compress.&lt;/STRONG&gt;&lt;STRONG&gt;BZip2Codec;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;InputFiles = LOAD
'/my/hdfs/path/' using PigStorage();&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;STORE InputFiles INTO
'/my/hdfs/path_compressed/' USING PigStorage();&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 23 Mar 2016 23:34:59 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Options-for-decompressing-HDFS-data-using-Pig/m-p/141134#M23606</guid>
      <dc:creator>zack_riesland</dc:creator>
      <dc:date>2016-03-23T23:34:59Z</dc:date>
    </item>
    <item>
      <title>Re: Options for decompressing HDFS data using Pig</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Options-for-decompressing-HDFS-data-using-Pig/m-p/141135#M23607</link>
      <description>&lt;P&gt;Not easily. MapReduce by definition groups files together as it pleases and then writes one output file per mapper/reducer. Those are the part files. &lt;/P&gt;&lt;P&gt;Pig will not accomodate what you want the whole stack is designed to put an abstraction layer over the data files it reads. &lt;/P&gt;&lt;P&gt;What you could do is something like hadoop streaming or to write your own inputformat that somehow forwards the data to the Reducers. However that will not be straightforward.&lt;/P&gt;&lt;P&gt;&lt;A href="https://hadoop.apache.org/docs/r1.2.1/streaming.html#How+do+I+process+files%2C+one+per+map%3F" target="_blank"&gt;https://hadoop.apache.org/docs/r1.2.1/streaming.html#How+do+I+process+files%2C+one+per+map%3F&lt;/A&gt;&lt;/P&gt;&lt;P&gt;So short answer. While possible it is not easy. Sorry.&lt;/P&gt;</description>
      <pubDate>Wed, 23 Mar 2016 23:39:54 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Options-for-decompressing-HDFS-data-using-Pig/m-p/141135#M23607</guid>
      <dc:creator>bleonhardi</dc:creator>
      <dc:date>2016-03-23T23:39:54Z</dc:date>
    </item>
    <item>
      <title>Re: Options for decompressing HDFS data using Pig</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Options-for-decompressing-HDFS-data-using-Pig/m-p/141136#M23608</link>
      <description>&lt;P&gt;Thanks Benjamin&lt;/P&gt;</description>
      <pubDate>Thu, 24 Mar 2016 03:14:49 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Options-for-decompressing-HDFS-data-using-Pig/m-p/141136#M23608</guid>
      <dc:creator>zack_riesland</dc:creator>
      <dc:date>2016-03-24T03:14:49Z</dc:date>
    </item>
    <item>
      <title>Re: Options for decompressing HDFS data using Pig</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Options-for-decompressing-HDFS-data-using-Pig/m-p/141137#M23609</link>
      <description>&lt;P&gt;If you  want to investigate this more there is a Hadoop streaming example in the Hadoop the definitive guide book , which might be of help. ( they get a list of files and then spin of reducers based of the files and run some Linux commands in the Reducers. You could essentially do anything you want.  )&lt;/P&gt;</description>
      <pubDate>Thu, 24 Mar 2016 18:51:03 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Options-for-decompressing-HDFS-data-using-Pig/m-p/141137#M23609</guid>
      <dc:creator>bleonhardi</dc:creator>
      <dc:date>2016-03-24T18:51:03Z</dc:date>
    </item>
  </channel>
</rss>

