<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Looking for a better explanation for &amp;quot;orc.row.index.stride&amp;quot; property in ORC in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Looking-for-a-better-explanation-for-quot-orc-row-index/m-p/149160#M20258</link>
    <description>&lt;P&gt;So first:&lt;/P&gt;&lt;P&gt;ORC indexes come in two forms, the standard indexes which are created all the time ( min/max values for each stride for each column ) and bloom filters. &lt;/P&gt;&lt;P&gt;Normal indexes are good for range queries and work amazingly well if the data is sorted. This is normally automatic on any date column or increasing columns like ids.&lt;/P&gt;&lt;P&gt;Bloom filters are great for equality queries of things like URLs, names, etc. on data that is not sorted. ( I.e. a customer name can happen sometimes in the data ).&lt;/P&gt;&lt;P&gt;However boom filters take some time to compute, take some space in the indexes and do not work well for most columns in a data warehouse ( number fields like profit, sales, ... ) So they are not created by default and need to be enabled for columns:&lt;/P&gt;&lt;P&gt;orc.bloom.filter.columns&lt;/P&gt;&lt;P&gt;The stride size means the block of data that can be skipped by the ORC reader during a read operation based on these indexes. 10000 is normally a good number and increasing it doesn't help you much. You can play a bit with it but I doubt you will get big performance improvements by changing it. I would expect more impact from block size ( which impacts how many mappers  are created ), compression ( zip is normally the best ). &lt;/P&gt;&lt;P&gt;But by far the most impact comes from good data modeling. I.e. Sorting the data during insert, Correct number of ORC files in the folder, data types used, etc. &lt;/P&gt;&lt;P&gt;shameless plug who explains it all a bit:&lt;/P&gt;&lt;P&gt;&lt;A href="http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data" target="_blank"&gt;http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 19 Feb 2016 17:29:28 GMT</pubDate>
    <dc:creator>bleonhardi</dc:creator>
    <dc:date>2016-02-19T17:29:28Z</dc:date>
    <item>
      <title>Looking for a better explanation for "orc.row.index.stride" property in ORC</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Looking-for-a-better-explanation-for-quot-orc-row-index/m-p/149159#M20257</link>
      <description>&lt;P&gt;The default value is set to 10,000 and should be &amp;gt; 100, as per the docs.&lt;/P&gt;&lt;P&gt;How should this value be changed or altered? Need some guidance.&lt;/P&gt;&lt;P&gt;If I have a large table of billion rows should we increase the value? Will this be affected by?&lt;/P&gt;&lt;P&gt;I am assuming also that the "orc.bloom.filter.columns" will be the list of columns on which the indexes will be created?&lt;/P&gt;</description>
      <pubDate>Fri, 19 Feb 2016 16:00:10 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Looking-for-a-better-explanation-for-quot-orc-row-index/m-p/149159#M20257</guid>
      <dc:creator>sdutta</dc:creator>
      <dc:date>2016-02-19T16:00:10Z</dc:date>
    </item>
    <item>
      <title>Re: Looking for a better explanation for "orc.row.index.stride" property in ORC</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Looking-for-a-better-explanation-for-quot-orc-row-index/m-p/149160#M20258</link>
      <description>&lt;P&gt;So first:&lt;/P&gt;&lt;P&gt;ORC indexes come in two forms, the standard indexes which are created all the time ( min/max values for each stride for each column ) and bloom filters. &lt;/P&gt;&lt;P&gt;Normal indexes are good for range queries and work amazingly well if the data is sorted. This is normally automatic on any date column or increasing columns like ids.&lt;/P&gt;&lt;P&gt;Bloom filters are great for equality queries of things like URLs, names, etc. on data that is not sorted. ( I.e. a customer name can happen sometimes in the data ).&lt;/P&gt;&lt;P&gt;However boom filters take some time to compute, take some space in the indexes and do not work well for most columns in a data warehouse ( number fields like profit, sales, ... ) So they are not created by default and need to be enabled for columns:&lt;/P&gt;&lt;P&gt;orc.bloom.filter.columns&lt;/P&gt;&lt;P&gt;The stride size means the block of data that can be skipped by the ORC reader during a read operation based on these indexes. 10000 is normally a good number and increasing it doesn't help you much. You can play a bit with it but I doubt you will get big performance improvements by changing it. I would expect more impact from block size ( which impacts how many mappers  are created ), compression ( zip is normally the best ). &lt;/P&gt;&lt;P&gt;But by far the most impact comes from good data modeling. I.e. Sorting the data during insert, Correct number of ORC files in the folder, data types used, etc. &lt;/P&gt;&lt;P&gt;shameless plug who explains it all a bit:&lt;/P&gt;&lt;P&gt;&lt;A href="http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data" target="_blank"&gt;http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 19 Feb 2016 17:29:28 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Looking-for-a-better-explanation-for-quot-orc-row-index/m-p/149160#M20258</guid>
      <dc:creator>bleonhardi</dc:creator>
      <dc:date>2016-02-19T17:29:28Z</dc:date>
    </item>
  </channel>
</rss>

