<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Pig LZO Inputsplits in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Pig-LZO-Inputsplits/m-p/5587#M983</link>
    <description>&lt;P class="p1"&gt;I've&amp;nbsp;sqooped&amp;nbsp;a&amp;nbsp;fairly&amp;nbsp;large&amp;nbsp;table&amp;nbsp;into&amp;nbsp;a&amp;nbsp;CDH4.5&amp;nbsp;cluster.&amp;nbsp;To&amp;nbsp;save&amp;nbsp;space&amp;nbsp;and&amp;nbsp;still&amp;nbsp;have&amp;nbsp;splittable&amp;nbsp;files&amp;nbsp;I've&amp;nbsp;used&amp;nbsp;LZO&amp;nbsp;compression&amp;nbsp;and&amp;nbsp;setup LZO&amp;nbsp;as&amp;nbsp;per&amp;nbsp;the&amp;nbsp;cloudera&amp;nbsp;instructions&amp;nbsp;[1].&lt;/P&gt;&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;Sqoop&amp;nbsp;makes&amp;nbsp;a&amp;nbsp;single&amp;nbsp;large&amp;nbsp;LZO&amp;nbsp;file(+index)&amp;nbsp;for&amp;nbsp;the&amp;nbsp;initial&amp;nbsp;import&amp;nbsp;and&amp;nbsp;adds&amp;nbsp;smaller&amp;nbsp;LZO+index&amp;nbsp;files&amp;nbsp;for&amp;nbsp;the&amp;nbsp;subsequent&amp;nbsp;incremental&amp;nbsp;sqoop&amp;nbsp;imports.&amp;nbsp;&lt;/P&gt;&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p2"&gt;My&amp;nbsp;problem&amp;nbsp;is&amp;nbsp;that&amp;nbsp;Pig&amp;nbsp;doesn't&amp;nbsp;seem&amp;nbsp;to&amp;nbsp;split&amp;nbsp;the&amp;nbsp;LZO&amp;nbsp;files&amp;nbsp;for&amp;nbsp;its&amp;nbsp;import,&amp;nbsp;while&amp;nbsp;Hive&amp;nbsp;does&amp;nbsp;so&amp;nbsp;perfectly.&amp;nbsp;Running&amp;nbsp;a&amp;nbsp;simple&amp;nbsp;pig&amp;nbsp;script&amp;nbsp;gives&amp;nbsp;me&amp;nbsp;exactly&amp;nbsp;7&amp;nbsp;mappers,&amp;nbsp;which&amp;nbsp;is&amp;nbsp;identical&amp;nbsp;to&amp;nbsp;the&amp;nbsp;lzo&amp;nbsp;files&amp;nbsp;in&amp;nbsp;the&amp;nbsp;HDFS&amp;nbsp;dir.&amp;nbsp;6&amp;nbsp;complete&amp;nbsp;quickly(the&amp;nbsp;incremental&amp;nbsp;files)&amp;nbsp;but&amp;nbsp;1&amp;nbsp;takes&amp;nbsp;a&amp;nbsp;very&amp;nbsp;long&amp;nbsp;time.&amp;nbsp;Can't&amp;nbsp;verify&amp;nbsp;but&amp;nbsp;that&amp;nbsp;has&amp;nbsp;to&amp;nbsp;be&amp;nbsp;the&amp;nbsp;one&amp;nbsp;large&amp;nbsp;file(90+&amp;nbsp;Gigs)&lt;/P&gt;&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p2"&gt;At&amp;nbsp;the&amp;nbsp;same&amp;nbsp;time&amp;nbsp;Hive&amp;nbsp;can&amp;nbsp;do&amp;nbsp;a&amp;nbsp;query&amp;nbsp;on&amp;nbsp;the&amp;nbsp;exact&amp;nbsp;same&amp;nbsp;table&amp;nbsp;and&amp;nbsp;get&amp;nbsp;700+&amp;nbsp;mappers&amp;nbsp;no&amp;nbsp;problem.&amp;nbsp;So&amp;nbsp;the&amp;nbsp;LZO&amp;nbsp;is&amp;nbsp;splittable,&amp;nbsp;but&amp;nbsp;Pig&amp;nbsp;seems&amp;nbsp;not&amp;nbsp;split&amp;nbsp;them.&amp;nbsp;&lt;/P&gt;&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p2"&gt;LZO&amp;nbsp;in&amp;nbsp;the&amp;nbsp;Pig&amp;nbsp;script&amp;nbsp;seems&amp;nbsp;to&amp;nbsp;be&amp;nbsp;enabled&amp;nbsp;since&amp;nbsp;I'm&amp;nbsp;getting&amp;nbsp;these&amp;nbsp;messages:&lt;/P&gt;&lt;P class="p2"&gt;INFO&amp;nbsp;com.hadoop.compression.lzo.GPLNativeCodeLoader&amp;nbsp;-&amp;nbsp;loaded&amp;nbsp;native&amp;nbsp;gpl&amp;nbsp;library&lt;/P&gt;&lt;P class="p1"&gt;INFO&amp;nbsp;com.hadoop.compression.lzo.lzoCodec&amp;nbsp;-&amp;nbsp;Successfully&amp;nbsp;loaded&amp;nbsp;&amp;amp;&amp;nbsp;initialized&amp;nbsp;native-lzo&amp;nbsp;library&amp;nbsp;[hadoop-lzo&amp;nbsp;rev&amp;nbsp;null]&lt;/P&gt;&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;Seems&amp;nbsp;like&amp;nbsp;Pig&amp;nbsp;can&amp;nbsp;read&amp;nbsp;the&amp;nbsp;LZO&amp;nbsp;files,&amp;nbsp;but&amp;nbsp;does&amp;nbsp;not&amp;nbsp;read&amp;nbsp;the&amp;nbsp;index&amp;nbsp;files&amp;nbsp;with&amp;nbsp;those&amp;nbsp;filed&amp;nbsp;to&amp;nbsp;determine&amp;nbsp;split&amp;nbsp;points.&lt;/P&gt;&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;Any&amp;nbsp;suggestions?&lt;/P&gt;&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;[1]&amp;nbsp;&lt;A target="_blank" href="https://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/latest/Cloudera-Manager-Installation-Guide/cmig_install_LZO_Compression.html"&gt;https://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/latest/Cloudera-Manager-Installation-Guide/cmig_install_LZO_Compression.html&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 16 Sep 2022 08:53:26 GMT</pubDate>
    <dc:creator>RobV</dc:creator>
    <dc:date>2022-09-16T08:53:26Z</dc:date>
    <item>
      <title>Pig LZO Inputsplits</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Pig-LZO-Inputsplits/m-p/5587#M983</link>
      <description>&lt;P class="p1"&gt;I've&amp;nbsp;sqooped&amp;nbsp;a&amp;nbsp;fairly&amp;nbsp;large&amp;nbsp;table&amp;nbsp;into&amp;nbsp;a&amp;nbsp;CDH4.5&amp;nbsp;cluster.&amp;nbsp;To&amp;nbsp;save&amp;nbsp;space&amp;nbsp;and&amp;nbsp;still&amp;nbsp;have&amp;nbsp;splittable&amp;nbsp;files&amp;nbsp;I've&amp;nbsp;used&amp;nbsp;LZO&amp;nbsp;compression&amp;nbsp;and&amp;nbsp;setup LZO&amp;nbsp;as&amp;nbsp;per&amp;nbsp;the&amp;nbsp;cloudera&amp;nbsp;instructions&amp;nbsp;[1].&lt;/P&gt;&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;Sqoop&amp;nbsp;makes&amp;nbsp;a&amp;nbsp;single&amp;nbsp;large&amp;nbsp;LZO&amp;nbsp;file(+index)&amp;nbsp;for&amp;nbsp;the&amp;nbsp;initial&amp;nbsp;import&amp;nbsp;and&amp;nbsp;adds&amp;nbsp;smaller&amp;nbsp;LZO+index&amp;nbsp;files&amp;nbsp;for&amp;nbsp;the&amp;nbsp;subsequent&amp;nbsp;incremental&amp;nbsp;sqoop&amp;nbsp;imports.&amp;nbsp;&lt;/P&gt;&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p2"&gt;My&amp;nbsp;problem&amp;nbsp;is&amp;nbsp;that&amp;nbsp;Pig&amp;nbsp;doesn't&amp;nbsp;seem&amp;nbsp;to&amp;nbsp;split&amp;nbsp;the&amp;nbsp;LZO&amp;nbsp;files&amp;nbsp;for&amp;nbsp;its&amp;nbsp;import,&amp;nbsp;while&amp;nbsp;Hive&amp;nbsp;does&amp;nbsp;so&amp;nbsp;perfectly.&amp;nbsp;Running&amp;nbsp;a&amp;nbsp;simple&amp;nbsp;pig&amp;nbsp;script&amp;nbsp;gives&amp;nbsp;me&amp;nbsp;exactly&amp;nbsp;7&amp;nbsp;mappers,&amp;nbsp;which&amp;nbsp;is&amp;nbsp;identical&amp;nbsp;to&amp;nbsp;the&amp;nbsp;lzo&amp;nbsp;files&amp;nbsp;in&amp;nbsp;the&amp;nbsp;HDFS&amp;nbsp;dir.&amp;nbsp;6&amp;nbsp;complete&amp;nbsp;quickly(the&amp;nbsp;incremental&amp;nbsp;files)&amp;nbsp;but&amp;nbsp;1&amp;nbsp;takes&amp;nbsp;a&amp;nbsp;very&amp;nbsp;long&amp;nbsp;time.&amp;nbsp;Can't&amp;nbsp;verify&amp;nbsp;but&amp;nbsp;that&amp;nbsp;has&amp;nbsp;to&amp;nbsp;be&amp;nbsp;the&amp;nbsp;one&amp;nbsp;large&amp;nbsp;file(90+&amp;nbsp;Gigs)&lt;/P&gt;&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p2"&gt;At&amp;nbsp;the&amp;nbsp;same&amp;nbsp;time&amp;nbsp;Hive&amp;nbsp;can&amp;nbsp;do&amp;nbsp;a&amp;nbsp;query&amp;nbsp;on&amp;nbsp;the&amp;nbsp;exact&amp;nbsp;same&amp;nbsp;table&amp;nbsp;and&amp;nbsp;get&amp;nbsp;700+&amp;nbsp;mappers&amp;nbsp;no&amp;nbsp;problem.&amp;nbsp;So&amp;nbsp;the&amp;nbsp;LZO&amp;nbsp;is&amp;nbsp;splittable,&amp;nbsp;but&amp;nbsp;Pig&amp;nbsp;seems&amp;nbsp;not&amp;nbsp;split&amp;nbsp;them.&amp;nbsp;&lt;/P&gt;&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p2"&gt;LZO&amp;nbsp;in&amp;nbsp;the&amp;nbsp;Pig&amp;nbsp;script&amp;nbsp;seems&amp;nbsp;to&amp;nbsp;be&amp;nbsp;enabled&amp;nbsp;since&amp;nbsp;I'm&amp;nbsp;getting&amp;nbsp;these&amp;nbsp;messages:&lt;/P&gt;&lt;P class="p2"&gt;INFO&amp;nbsp;com.hadoop.compression.lzo.GPLNativeCodeLoader&amp;nbsp;-&amp;nbsp;loaded&amp;nbsp;native&amp;nbsp;gpl&amp;nbsp;library&lt;/P&gt;&lt;P class="p1"&gt;INFO&amp;nbsp;com.hadoop.compression.lzo.lzoCodec&amp;nbsp;-&amp;nbsp;Successfully&amp;nbsp;loaded&amp;nbsp;&amp;amp;&amp;nbsp;initialized&amp;nbsp;native-lzo&amp;nbsp;library&amp;nbsp;[hadoop-lzo&amp;nbsp;rev&amp;nbsp;null]&lt;/P&gt;&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;Seems&amp;nbsp;like&amp;nbsp;Pig&amp;nbsp;can&amp;nbsp;read&amp;nbsp;the&amp;nbsp;LZO&amp;nbsp;files,&amp;nbsp;but&amp;nbsp;does&amp;nbsp;not&amp;nbsp;read&amp;nbsp;the&amp;nbsp;index&amp;nbsp;files&amp;nbsp;with&amp;nbsp;those&amp;nbsp;filed&amp;nbsp;to&amp;nbsp;determine&amp;nbsp;split&amp;nbsp;points.&lt;/P&gt;&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;Any&amp;nbsp;suggestions?&lt;/P&gt;&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;&amp;nbsp;&lt;/P&gt;&lt;P class="p1"&gt;[1]&amp;nbsp;&lt;A target="_blank" href="https://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/latest/Cloudera-Manager-Installation-Guide/cmig_install_LZO_Compression.html"&gt;https://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/latest/Cloudera-Manager-Installation-Guide/cmig_install_LZO_Compression.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 08:53:26 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Pig-LZO-Inputsplits/m-p/5587#M983</guid>
      <dc:creator>RobV</dc:creator>
      <dc:date>2022-09-16T08:53:26Z</dc:date>
    </item>
    <item>
      <title>Re: Pig LZO Inputsplits</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Pig-LZO-Inputsplits/m-p/5589#M984</link>
      <description>Pig's default PigStorage loader may not understand how to use the&lt;BR /&gt;index files created alongside. You'll need to use the ElephantBird&lt;BR /&gt;loader functions available at&lt;BR /&gt;&lt;A target="_blank" href="https://github.com/kevinweil/elephant-bird"&gt;https://github.com/kevinweil/elephant-bird&lt;/A&gt; to properly load them in a&lt;BR /&gt;scalable way (you need its&lt;BR /&gt;com.twitter.elephantbird.pig.load.LzoTextLoader loader specifically,&lt;BR /&gt;for indexed LZO text files).&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Tue, 04 Feb 2014 13:51:04 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Pig-LZO-Inputsplits/m-p/5589#M984</guid>
      <dc:creator>Harsh J</dc:creator>
      <dc:date>2014-02-04T13:51:04Z</dc:date>
    </item>
    <item>
      <title>Re: Pig LZO Inputsplits</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Pig-LZO-Inputsplits/m-p/5597#M985</link>
      <description>&lt;P&gt;Is there another way to sqoop &amp;nbsp;data into a compressed container format(any) and have Hive and Pig understand its splits?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;afik:&lt;/P&gt;&lt;P&gt;- Sqooping with snappy will not result in splittable files&lt;/P&gt;&lt;P&gt;- Sqooping to hive and using --as-avrofile&amp;nbsp;+ snappy is not compatible&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Asside from the EB libs, the only way I see is not using compression, is this correct?&lt;/P&gt;</description>
      <pubDate>Tue, 04 Feb 2014 14:52:32 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Pig-LZO-Inputsplits/m-p/5597#M985</guid>
      <dc:creator>RobV</dc:creator>
      <dc:date>2014-02-04T14:52:32Z</dc:date>
    </item>
    <item>
      <title>Re: Pig LZO Inputsplits</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Pig-LZO-Inputsplits/m-p/5671#M986</link>
      <description>&lt;P&gt;Thnx for the hint, to be complete:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Sqooped data to Hive uses '\u0001' as a field delimiter. LzoTextLoader does not support adding a custom delimiter, use LzoTokenizedLoader for that. Works like a charm after that&amp;nbsp;&lt;img id="smileywink" class="emoticon emoticon-smileywink" src="https://community.cloudera.com/i/smilies/16x16_smiley-wink.png" alt="Smiley Wink" title="Smiley Wink" /&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 05 Feb 2014 09:33:06 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Pig-LZO-Inputsplits/m-p/5671#M986</guid>
      <dc:creator>RobV</dc:creator>
      <dc:date>2014-02-05T09:33:06Z</dc:date>
    </item>
  </channel>
</rss>

