<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Spark ORC Stripe Size in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Spark-ORC-Stripe-Size/m-p/189850#M151940</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/15131/dhyun.html" nodeid="15131"&gt;@Dongjoon Hyun&lt;/A&gt; Just want to check if the ORC library version change i.e to ORC 1.4.1 is getting picked or not as part of Spark 2.3 release, I have gone through the PR's under SPARK-20901, but I didn't find any conversation related to ORC library upgrade  &lt;/P&gt;</description>
    <pubDate>Wed, 31 Jan 2018 17:42:19 GMT</pubDate>
    <dc:creator>rajivchodisetti</dc:creator>
    <dc:date>2018-01-31T17:42:19Z</dc:date>
    <item>
      <title>Spark ORC Stripe Size</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-ORC-Stripe-Size/m-p/189844#M151934</link>
      <description>&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;
        1
        down vote

        &lt;A href="https://stackoverflow.com/questions/48250778/spark-small-orc-stripes#"&gt;favorite&lt;/A&gt;
        &lt;/TD&gt;&lt;TD&gt;

    &lt;P&gt;We use Spark to flatten out clickstream data and then write the same 
to S3 in ORC+zlib format, I have tried changing many settings in Spark 
but still the resultant stripe sizes of the ORC files getting created are
 very small (&amp;lt;2MB)&lt;/P&gt;

&lt;P&gt;Things which I tried so far to decrease the stripe size,&lt;/P&gt;

&lt;P&gt;Earlier each file was 20MB in size, using coalesce I am now creating 
files which are of 250-300MB in size, but still there are 200 stripes 
per file i.e each stripe &amp;lt;2MB in size&lt;/P&gt;

&lt;P&gt;Tried using hivecontext instead of sparkcontext by setting 
hive.exec.orc.default.stripe.size to 67108864, but spark isn't honoring 
these parameters.&lt;/P&gt;

&lt;P&gt;So, Any idea on how can I increase the stripe sizes of ORC files 
being created ? because the problem with small stripes is , when we are 
querying these ORC files using Presto and when stripe size is less than 
8MB, then Presto will read the whole data file instead of the selected 
fields in the query.&lt;/P&gt;

&lt;P&gt;Presto Stripe issue related thread: &lt;A href="https://groups.google.com/forum/#!topic/presto-users/7NcrFvGpPaA"&gt;https://groups.google.com/forum/#!topic/presto-users/7NcrFvGpPaA&lt;/A&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;</description>
      <pubDate>Mon, 15 Jan 2018 15:01:57 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-ORC-Stripe-Size/m-p/189844#M151934</guid>
      <dc:creator>rajivchodisetti</dc:creator>
      <dc:date>2018-01-15T15:01:57Z</dc:date>
    </item>
    <item>
      <title>Re: Spark ORC Stripe Size</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-ORC-Stripe-Size/m-p/189845#M151935</link>
      <description>&lt;P&gt;Hi, &lt;A rel="user" href="https://community.cloudera.com/users/45423/rajivchodisetti54.html" nodeid="45423"&gt;@Rajiv Chodisetti&lt;/A&gt; .&lt;/P&gt;&lt;P&gt;It's related to &lt;A href="https://issues.apache.org/jira/browse/HIVE-13232"&gt; HIVE-13232&lt;/A&gt;  (fixed in Hive 1.3.0, 2.0.1, 2.1.0), but all Apache Spark still uses Hive 1.2.1 library.&lt;/P&gt;&lt;P&gt;Could you try HDP 2.6.3+ (2.6.4 is the latest one). HDP Spark 2.2 has that fixed hive library.&lt;/P&gt;</description>
      <pubDate>Tue, 16 Jan 2018 03:21:28 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-ORC-Stripe-Size/m-p/189845#M151935</guid>
      <dc:creator>dhyun</dc:creator>
      <dc:date>2018-01-16T03:21:28Z</dc:date>
    </item>
    <item>
      <title>Re: Spark ORC Stripe Size</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-ORC-Stripe-Size/m-p/189846#M151936</link>
      <description>&lt;P&gt;Thanks Dongjoon for the reply. But what about the people who doesn't use HDP? Is there any open JIRA where some one is working on integrating latest version of Hive with Spark , if you are aware of any such thread , can you please share that link ?&lt;/P&gt;</description>
      <pubDate>Tue, 16 Jan 2018 14:32:39 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-ORC-Stripe-Size/m-p/189846#M151936</guid>
      <dc:creator>rajivchodisetti</dc:creator>
      <dc:date>2018-01-16T14:32:39Z</dc:date>
    </item>
    <item>
      <title>Re: Spark ORC Stripe Size</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-ORC-Stripe-Size/m-p/189847#M151937</link>
      <description>&lt;P&gt;If you can wait for it, Apache Spark 2.3 will be released with Apache ORC 1.4.1.&lt;/P&gt;&lt;P&gt;There are many ORC patch in Hive. Apache Spark cannot sync it promptly.&lt;/P&gt;&lt;P&gt;So, in Apache Spark, we decide to use the latest ORC 1.4.1 library instead of upgrading Hive 1.2.1 library.&lt;/P&gt;&lt;P&gt;From Apache Spark 2.3, Hive ORC table is converted into ORC data sources tables by default and uses ORC 1.4.1 library to read it.&lt;/P&gt;&lt;P&gt;Not only your issue but also vectorization on ORC are supported.&lt;/P&gt;&lt;P&gt;Anyway, again, HDP 2.6.3+ is already shipped with ORC 1.4.1 with vectorization, too.&lt;/P&gt;</description>
      <pubDate>Wed, 17 Jan 2018 00:54:08 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-ORC-Stripe-Size/m-p/189847#M151937</guid>
      <dc:creator>dhyun</dc:creator>
      <dc:date>2018-01-17T00:54:08Z</dc:date>
    </item>
    <item>
      <title>Re: Spark ORC Stripe Size</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-ORC-Stripe-Size/m-p/189848#M151938</link>
      <description>&lt;P&gt;As of now, Apache JIRA is `Maintenance in progress`. So, I cannot give you the link. The umbrella ORC JIRA is &lt;/P&gt;&lt;P&gt;&lt;A href="https://issues.apache.org/jira/browse/SPARK-20901" target="_blank"&gt;https://issues.apache.org/jira/browse/SPARK-20901&lt;/A&gt;.&lt;/P&gt;</description>
      <pubDate>Wed, 17 Jan 2018 00:55:16 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-ORC-Stripe-Size/m-p/189848#M151938</guid>
      <dc:creator>dhyun</dc:creator>
      <dc:date>2018-01-17T00:55:16Z</dc:date>
    </item>
    <item>
      <title>Re: Spark ORC Stripe Size</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-ORC-Stripe-Size/m-p/189849#M151939</link>
      <description>&lt;P&gt;Thanks for the update, Vectorisation support is one other feature we have been looking for so long&lt;/P&gt;</description>
      <pubDate>Wed, 17 Jan 2018 02:49:39 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-ORC-Stripe-Size/m-p/189849#M151939</guid>
      <dc:creator>rajivchodisetti</dc:creator>
      <dc:date>2018-01-17T02:49:39Z</dc:date>
    </item>
    <item>
      <title>Re: Spark ORC Stripe Size</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-ORC-Stripe-Size/m-p/189850#M151940</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/15131/dhyun.html" nodeid="15131"&gt;@Dongjoon Hyun&lt;/A&gt; Just want to check if the ORC library version change i.e to ORC 1.4.1 is getting picked or not as part of Spark 2.3 release, I have gone through the PR's under SPARK-20901, but I didn't find any conversation related to ORC library upgrade  &lt;/P&gt;</description>
      <pubDate>Wed, 31 Jan 2018 17:42:19 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-ORC-Stripe-Size/m-p/189850#M151940</guid>
      <dc:creator>rajivchodisetti</dc:creator>
      <dc:date>2018-01-31T17:42:19Z</dc:date>
    </item>
    <item>
      <title>Re: Spark ORC Stripe Size</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-ORC-Stripe-Size/m-p/189851#M151941</link>
      <description>&lt;P&gt;In SPARK-20901 `Feature Parity for ORC with Parquet`, you can see the
 issue links marked as `is blocked by`. Among them, the following issues
 are what you want to see for ORC library,
&lt;/P&gt;&lt;P&gt;- SPARK-21422 Depend on Apache ORC 1.4.0&lt;/P&gt;&lt;P&gt;- SPARK-22300 Update ORC to 1.4.1
&lt;/P&gt;&lt;P&gt;In addition to that, the following will convert Hive ORC table into Spark data sources tables to use Apache ORC 1.4.1.&lt;/P&gt;&lt;P&gt;- SPARK-22279 Turn on spark.sql.hive.convertMetastoreOrc by default&lt;/P&gt;</description>
      <pubDate>Thu, 01 Feb 2018 01:08:58 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-ORC-Stripe-Size/m-p/189851#M151941</guid>
      <dc:creator>dhyun</dc:creator>
      <dc:date>2018-02-01T01:08:58Z</dc:date>
    </item>
    <item>
      <title>Re: Spark ORC Stripe Size</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-ORC-Stripe-Size/m-p/189852#M151942</link>
      <description>&lt;P&gt;I added the comment in &lt;A href="https://community.hortonworks.com/answers/167350/view.html"&gt;the above&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 01 Feb 2018 01:10:09 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-ORC-Stripe-Size/m-p/189852#M151942</guid>
      <dc:creator>dhyun</dc:creator>
      <dc:date>2018-02-01T01:10:09Z</dc:date>
    </item>
    <item>
      <title>Re: Spark ORC Stripe Size</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-ORC-Stripe-Size/m-p/279624#M208368</link>
      <description>&lt;P&gt;Hi,&lt;BR /&gt;&lt;BR /&gt;I'm using Hive version:2.3.4 and Spark: 2.4.4 with Hadoop: 2.8.5 but still my pyspark code is not taking my Stripe size parameter for ORC creation. I have posted a new question this community as well.&lt;BR /&gt;&lt;BR /&gt;&lt;A href="https://community.cloudera.com/t5/Support-Questions/Unable-to-set-stripe-size-for-the-orc-file-using-python/td-p/278918" target="_blank"&gt;https://community.cloudera.com/t5/Support-Questions/Unable-to-set-stripe-size-for-the-orc-file-using-python/td-p/278918&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Could you please advise on this.&lt;BR /&gt;&lt;BR /&gt;Thanks,&lt;/P&gt;&lt;P&gt;Sai&lt;/P&gt;</description>
      <pubDate>Mon, 07 Oct 2019 23:47:10 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-ORC-Stripe-Size/m-p/279624#M208368</guid>
      <dc:creator>Desu</dc:creator>
      <dc:date>2019-10-07T23:47:10Z</dc:date>
    </item>
  </channel>
</rss>

