<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Snappy vs. Zlib - Pros and Cons for each compression in Hive/ Orc files in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive/m-p/97110#M60359</link>
    <description>&lt;P&gt;I had couple of questions on the file compression. We plan on using ORC format for a data zone that will be heavily accessed by the end-users via Hive/JDBC.&lt;/P&gt;&lt;P&gt;What is the recommendation when it comes to &lt;A href="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-HiveQLSyntax"&gt;compressing&lt;/A&gt; ORC files?&lt;/P&gt;&lt;P&gt;Do you think Snappy is a better option (over ZLIB) given Snappy’s better read-performance?  (Snappy is more performant in a read-often scenario, which is usually the case for Hive data.)  When would you choose zlib?&lt;/P&gt;&lt;P&gt;
As a side note: Compression is a double-edged sword, as you can go also have performance issue going from larger file sizes spread among multiple nodes to the smaller size &amp;amp; HDFS block size interactions.  You can blunt this by using &lt;A href="https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.exec.orc.compression.strategy"&gt;compression strategy&lt;/A&gt;.&lt;/P&gt;</description>
    <pubDate>Tue, 17 Nov 2015 04:32:39 GMT</pubDate>
    <dc:creator>amcbarnett</dc:creator>
    <dc:date>2015-11-17T04:32:39Z</dc:date>
    <item>
      <title>Snappy vs. Zlib - Pros and Cons for each compression in Hive/ Orc files</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive/m-p/97110#M60359</link>
      <description>&lt;P&gt;I had couple of questions on the file compression. We plan on using ORC format for a data zone that will be heavily accessed by the end-users via Hive/JDBC.&lt;/P&gt;&lt;P&gt;What is the recommendation when it comes to &lt;A href="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-HiveQLSyntax"&gt;compressing&lt;/A&gt; ORC files?&lt;/P&gt;&lt;P&gt;Do you think Snappy is a better option (over ZLIB) given Snappy’s better read-performance?  (Snappy is more performant in a read-often scenario, which is usually the case for Hive data.)  When would you choose zlib?&lt;/P&gt;&lt;P&gt;
As a side note: Compression is a double-edged sword, as you can go also have performance issue going from larger file sizes spread among multiple nodes to the smaller size &amp;amp; HDFS block size interactions.  You can blunt this by using &lt;A href="https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.exec.orc.compression.strategy"&gt;compression strategy&lt;/A&gt;.&lt;/P&gt;</description>
      <pubDate>Tue, 17 Nov 2015 04:32:39 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive/m-p/97110#M60359</guid>
      <dc:creator>amcbarnett</dc:creator>
      <dc:date>2015-11-17T04:32:39Z</dc:date>
    </item>
    <item>
      <title>Re: Snappy vs. Zlib - Pros and Cons for each compression in Hive/ Orc files</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive/m-p/97111#M60360</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/369/amcbarnett.html" nodeid="369" target="_blank"&gt;@Ancil McBarnett&lt;/A&gt; Performance! Performance! and performance! &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;&lt;P&gt;ORC + Zlib is the way go. &lt;/P&gt;&lt;P&gt;Here are the details based on a test done in my env.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;run 1 vs. run 2 &lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="481-screen-shot-2015-11-16-at-34624-pm.png" style="width: 1322px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/23983i4207171FC228323F/image-size/medium?v=v2&amp;amp;px=400" role="button" title="481-screen-shot-2015-11-16-at-34624-pm.png" alt="481-screen-shot-2015-11-16-at-34624-pm.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 19 Aug 2019 12:50:00 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive/m-p/97111#M60360</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2019-08-19T12:50:00Z</dc:date>
    </item>
    <item>
      <title>Re: Snappy vs. Zlib - Pros and Cons for each compression in Hive/ Orc files</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive/m-p/97112#M60361</link>
      <description>&lt;P&gt;ORC+ZLib seems to have the better performance. ZLib is also the default compression option, however there are definitely valid cases for Snappy.&lt;/P&gt;&lt;P&gt;I like the comment from David (&lt;EM&gt;2014, before ZLib Update&lt;/EM&gt;) &lt;EM&gt;"SNAPPY for time based performance, ZLIB for resource performance (Drive Space)." &lt;/EM&gt;Make sure you checkout David's post:  &lt;A target="_blank" href="https://streever.atlassian.net/wiki/display/HADOOP/Optimizing+ORC+Files+for+Query+Performance"&gt;https://streever.atlassian.net/wiki/display/HADOOP/Optimizing+ORC+Files+for+Query+Performance&lt;/A&gt;&lt;/P&gt;&lt;P&gt;As &lt;A rel="user" href="https://community.cloudera.com/users/301/gopal.html" nodeid="301"&gt;@gopal&lt;/A&gt; pointed out in the comment, we have switched to a &lt;EM&gt;&lt;STRONG&gt;new ZLib algorithm&lt;/STRONG&gt;&lt;/EM&gt;, hence the combination ORC + (new) ZLib is the way to go. The performance difference of ZLib and Snappy regarding disk writes is rather small.&lt;/P&gt;&lt;P&gt;Btw. ZLib is not always the better option, when it comes to HBase, Snappy is usually better &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 17 Nov 2015 05:15:50 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive/m-p/97112#M60361</guid>
      <dc:creator>jstraub</dc:creator>
      <dc:date>2015-11-17T05:15:50Z</dc:date>
    </item>
    <item>
      <title>Re: Snappy vs. Zlib - Pros and Cons for each compression in Hive/ Orc files</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive/m-p/97113#M60362</link>
      <description>&lt;P&gt;Thanks for sharing! How many datasets were in the Links table? Is the dataset in Links a subset from the ABC dataset?&lt;/P&gt;</description>
      <pubDate>Tue, 17 Nov 2015 05:18:40 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive/m-p/97113#M60362</guid>
      <dc:creator>jstraub</dc:creator>
      <dc:date>2015-11-17T05:18:40Z</dc:date>
    </item>
    <item>
      <title>Re: Snappy vs. Zlib - Pros and Cons for each compression in Hive/ Orc files</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive/m-p/97114#M60363</link>
      <description>&lt;P&gt;ABC and Links were separate tables.  &lt;A rel="user" href="https://community.cloudera.com/users/113/jstraub.html" nodeid="113"&gt;@Jonas Straub&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 17 Nov 2015 07:26:59 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive/m-p/97114#M60363</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2015-11-17T07:26:59Z</dc:date>
    </item>
    <item>
      <title>Re: Snappy vs. Zlib - Pros and Cons for each compression in Hive/ Orc files</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive/m-p/97115#M60364</link>
      <description>&lt;P&gt;David's post is from 2014. Since then we switched away from standard Zlib in ORC.&lt;/P&gt;&lt;P&gt;
See the slides from &lt;A href="http://www.slideshare.net/Hadoop_Summit/orc-2015-faster-better-smaller/14"&gt;ORC 2015: Faster, Better, Smaller&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Each column type (like string, int etc) get different Zlib compatible algorithms for compression (i.e different trade-offs of RLE/Huffman/LZ77).&lt;/P&gt;&lt;P&gt;ORC+Zlib after the columnar improvements no longer has the historic weaknesses of Zlib, so it is faster than SNAPPY to read, smaller than SNAPPY on disk and only ~10% slower than SNAPPY to write it out.&lt;/P&gt;</description>
      <pubDate>Wed, 18 Nov 2015 14:00:12 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive/m-p/97115#M60364</guid>
      <dc:creator>gopalv</dc:creator>
      <dc:date>2015-11-18T14:00:12Z</dc:date>
    </item>
    <item>
      <title>Re: Snappy vs. Zlib - Pros and Cons for each compression in Hive/ Orc files</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive/m-p/97116#M60365</link>
      <description>&lt;P&gt;Thanks &lt;A rel="user" href="https://community.cloudera.com/users/301/gopal.html" nodeid="301"&gt;@gopal&lt;/A&gt;. In this case we should definitely use ORC+(new)Zlib. I'll edit my answer &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 18 Nov 2015 14:03:47 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive/m-p/97116#M60365</guid>
      <dc:creator>jstraub</dc:creator>
      <dc:date>2015-11-18T14:03:47Z</dc:date>
    </item>
    <item>
      <title>Re: Snappy vs. Zlib - Pros and Cons for each compression in Hive/ Orc files</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive/m-p/97117#M60366</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/301/gopal.html" nodeid="301"&gt;@gopal&lt;/A&gt; just to confirm, these improvements would require HDP 2.3.x and later correct?&lt;/P&gt;</description>
      <pubDate>Wed, 25 Nov 2015 00:42:38 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive/m-p/97117#M60366</guid>
      <dc:creator>tbenton</dc:creator>
      <dc:date>2015-11-25T00:42:38Z</dc:date>
    </item>
    <item>
      <title>Re: Snappy vs. Zlib - Pros and Cons for each compression in Hive/ Orc files</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive/m-p/97118#M60367</link>
      <description>&lt;P&gt;Any updates for 2016&lt;/P&gt;</description>
      <pubDate>Sat, 04 Jun 2016 12:07:02 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive/m-p/97118#M60367</guid>
      <dc:creator>TimothySpann</dc:creator>
      <dc:date>2016-06-04T12:07:02Z</dc:date>
    </item>
    <item>
      <title>Re: Snappy vs. Zlib - Pros and Cons for each compression in Hive/ Orc files</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive/m-p/97119#M60368</link>
      <description>&lt;P&gt;ORC is considering adding a faster decompression in 2016 - zstd (&lt;A href="https://github.com/Cyan4973/zstd/blob/master/README.md"&gt;ZStandard&lt;/A&gt;). The enum values for that has already been reserved, but until we work through the trade-offs involved in ZStd - more on that sometime later this year.&lt;/P&gt;&lt;P&gt;
&lt;A href="https://issues.apache.org/jira/browse/ORC-46"&gt;https://issues.apache.org/jira/browse/ORC-46&lt;/A&gt;&lt;/P&gt;&lt;P&gt;But bigger wins are in motion for ORC with LLAP, the in-memory format for LLAP isn't compressed at all - so it performs like ORC without compression overheads, while letting the cold data on disk sit around in Zlib.&lt;/P&gt;</description>
      <pubDate>Sat, 04 Jun 2016 12:34:59 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive/m-p/97119#M60368</guid>
      <dc:creator>gopalv</dc:creator>
      <dc:date>2016-06-04T12:34:59Z</dc:date>
    </item>
  </channel>
</rss>

