<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Spark SQL: Limit clause performance issues in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-SQL-Limit-clause-performance-issues/m-p/197828#M83382</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/17662/dmueller1607.html" nodeid="17662"&gt;@Daniel Müller&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Could you share the explain extended for the above query? From the logical/physical plan details you could see whether filter pushdown is includes the limit. If this is spark with llap integration, I know this is not supported previous HDP 3.0. Starting HDP 3.0 we have added the HWC (hive warehouse connector) for spark, which will work as expected. &lt;/P&gt;&lt;P&gt;HTH&lt;/P&gt;&lt;P&gt;*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.&lt;/P&gt;</description>
    <pubDate>Tue, 11 Sep 2018 19:18:21 GMT</pubDate>
    <dc:creator>falbani</dc:creator>
    <dc:date>2018-09-11T19:18:21Z</dc:date>
    <item>
      <title>Spark SQL: Limit clause performance issues</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-SQL-Limit-clause-performance-issues/m-p/197827#M83381</link>
      <description>&lt;P&gt;I have a huge Hive Table (ORC) and I want to select just a few rows of the table (in Zeppelin). &lt;/P&gt;&lt;PRE&gt;%spark&lt;BR /&gt;sqlContext.setConf("spark.sql.orc.filterPushdown", "true")&lt;BR /&gt;val df1 = sqlContext.sql("SELECT * FROM mydb.myhugetable LIMIT 1")	// Takes 10 mins&lt;BR /&gt;val df2 = sqlContext.sql("SELECT * FROM mydb.myhugetable").limit(1)	// Takes 10 mins&lt;/PRE&gt;&lt;P&gt;Using the &lt;STRONG&gt;LIMIT &lt;/STRONG&gt;clause in my SQL statement or the corresponding dataframe method &lt;STRONG&gt;DF.limit&lt;/STRONG&gt; doesn't help, as the query still takes too long. It seems to read the whole table first and then just returning the n rows. &lt;/P&gt;&lt;P&gt;How can I achieve, that the filter limits the data during running the SQL and therefore runs faster? &lt;/P&gt;&lt;P&gt;Shouldn't the&lt;STRONG&gt; filter pushdown&lt;/STRONG&gt; help here? I can't see any difference with the setting &lt;STRONG&gt;spark.sql.orc.filterPushdown&lt;/STRONG&gt; set to true or false.
Thank you!&lt;/P&gt;</description>
      <pubDate>Tue, 11 Sep 2018 19:01:17 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-SQL-Limit-clause-performance-issues/m-p/197827#M83381</guid>
      <dc:creator>dmueller1607</dc:creator>
      <dc:date>2018-09-11T19:01:17Z</dc:date>
    </item>
    <item>
      <title>Re: Spark SQL: Limit clause performance issues</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-SQL-Limit-clause-performance-issues/m-p/197828#M83382</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/17662/dmueller1607.html" nodeid="17662"&gt;@Daniel Müller&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Could you share the explain extended for the above query? From the logical/physical plan details you could see whether filter pushdown is includes the limit. If this is spark with llap integration, I know this is not supported previous HDP 3.0. Starting HDP 3.0 we have added the HWC (hive warehouse connector) for spark, which will work as expected. &lt;/P&gt;&lt;P&gt;HTH&lt;/P&gt;&lt;P&gt;*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.&lt;/P&gt;</description>
      <pubDate>Tue, 11 Sep 2018 19:18:21 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-SQL-Limit-clause-performance-issues/m-p/197828#M83382</guid>
      <dc:creator>falbani</dc:creator>
      <dc:date>2018-09-11T19:18:21Z</dc:date>
    </item>
    <item>
      <title>Re: Spark SQL: Limit clause performance issues</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-SQL-Limit-clause-performance-issues/m-p/197829#M83383</link>
      <description>&lt;P&gt;Here the extended explain of the DF, LLAP is not enabled in our cluster:&lt;/P&gt;&lt;PRE&gt;df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [importtime: timestamp, tester: string ... 21 more fields]
== Parsed Logical Plan ==
GlobalLimit 1
+- LocalLimit 1
   +- Project [importtime#0, tester#1, testerhead#2, matchcode#3, revision#4, usertext#5, teststufe#6, temp#7, filedat#8, starttime#9, endtime#10, lotnrc#11, wafernr#12, testname#13, testnumber#14, testtype#15, unit#16, highl#17, lowl#18, highs#19, lows#20, valuelist#21, hashvalue#22]
      +- SubqueryAlias mytable
         +- Relation[importtime#0,tester#1,testerhead#2,matchcode#3,revision#4,usertext#5,teststufe#6,temp#7,filedat#8,starttime#9,endtime#10,lotnrc#11,wafernr#12,testname#13,testnumber#14,testtype#15,unit#16,highl#17,lowl#18,highs#19,lows#20,valuelist#21,hashvalue#22] orc
== Analyzed Logical Plan ==
importtime: timestamp, tester: string, testerhead: string, matchcode: string, revision: string, usertext: string, teststufe: string, temp: string, filedat: date, starttime: timestamp, endtime: timestamp, lotnrc: string, wafernr: string, testname: string, testnumber: string, testtype: string, unit: string, highl: string, lowl: string, highs: string, lows: string, valuelist: string, hashvalue: int
GlobalLimit 1
+- LocalLimit 1
   +- Project [importtime#0, tester#1, testerhead#2, matchcode#3, revision#4, usertext#5, teststufe#6, temp#7, filedat#8, starttime#9, endtime#10, lotnrc#11, wafernr#12, testname#13, testnumber#14, testtype#15, unit#16, highl#17, lowl#18, highs#19, lows#20, valuelist#21, hashvalue#22]
      +- SubqueryAlias mytable
         +- Relation[importtime#0,tester#1,testerhead#2,matchcode#3,revision#4,usertext#5,teststufe#6,temp#7,filedat#8,starttime#9,endtime#10,lotnrc#11,wafernr#12,testname#13,testnumber#14,testtype#15,unit#16,highl#17,lowl#18,highs#19,lows#20,valuelist#21,hashvalue#22] orc
== Optimized Logical Plan ==
InMemoryRelation [importtime#0, tester#1, testerhead#2, matchcode#3, revision#4, usertext#5, teststufe#6, temp#7, filedat#8, starttime#9, endtime#10, lotnrc#11, wafernr#12, testname#13, testnumber#14, testtype#15, unit#16, highl#17, lowl#18, highs#19, lows#20, valuelist#21, hashvalue#22], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
   +- CollectLimit 1
      +- *FileScan orc mydb.mytable[importtime#0,tester#1,testerhead#2,matchcode#3,revision#4,usertext#5,teststufe#6,temp#7,filedat#8,starttime#9,endtime#10,lotnrc#11,wafernr#12,testname#13,testnumber#14,testtype#15,unit#16,highl#17,lowl#18,highs#19,lows#20,valuelist#21,hashvalue#22] Batched: false, Format: ORC, Location: CatalogFileIndex[hdfs://hdp-m-01:8020/apps/hive/warehouse/mydb.db/mytab..., PartitionCount: 1000, PartitionFilters: [], PushedFilters: [], ReadSchema: struct&amp;lt;importtime:timestamp,tester:string,testerhead:string,matchcode:string,revision:string,user...
== Physical Plan ==
InMemoryTableScan [importtime#0, tester#1, testerhead#2, matchcode#3, revision#4, usertext#5, teststufe#6, temp#7, filedat#8, starttime#9, endtime#10, lotnrc#11, wafernr#12, testname#13, testnumber#14, testtype#15, unit#16, highl#17, lowl#18, highs#19, lows#20, valuelist#21, hashvalue#22]
   +- InMemoryRelation [importtime#0, tester#1, testerhead#2, matchcode#3, revision#4, usertext#5, teststufe#6, temp#7, filedat#8, starttime#9, endtime#10, lotnrc#11, wafernr#12, testname#13, testnumber#14, testtype#15, unit#16, highl#17, lowl#18, highs#19, lows#20, valuelist#21, hashvalue#22], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
         +- CollectLimit 1
            +- *FileScan orc mydb.mytable[importtime#0,tester#1,testerhead#2,matchcode#3,revision#4,usertext#5,teststufe#6,temp#7,filedat#8,starttime#9,endtime#10,lotnrc#11,wafernr#12,testname#13,testnumber#14,testtype#15,unit#16,highl#17,lowl#18,highs#19,lows#20,valuelist#21,hashvalue#22] Batched: false, Format: ORC, Location: CatalogFileIndex[hdfs://hdp-m-01:8020/apps/hive/warehouse/mydb.db/mytab..., PartitionCount: 1000, PartitionFilters: [], PushedFilters: [], ReadSchema: struct&amp;lt;importtime:timestamp,tester:string,testerhead:string,matchcode:string,revision:string,user...
&amp;lt;br&amp;gt;&lt;/PRE&gt;</description>
      <pubDate>Tue, 11 Sep 2018 21:06:00 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-SQL-Limit-clause-performance-issues/m-p/197829#M83383</guid>
      <dc:creator>dmueller1607</dc:creator>
      <dc:date>2018-09-11T21:06:00Z</dc:date>
    </item>
    <item>
      <title>Re: Spark SQL: Limit clause performance issues</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-SQL-Limit-clause-performance-issues/m-p/197830#M83384</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/17662/dmueller1607.html" nodeid="17662"&gt;@Daniel Müller&lt;/A&gt; As I thought the PushedFilters are empty. I checked the spark.sql.orc.filterPushdown implementation details and looks like LIMIT is not supported. You can read more by looking at inline comments here:&lt;/P&gt;&lt;P&gt;&lt;A href="https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala" target="_blank"&gt;https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala&lt;/A&gt;&lt;/P&gt;&lt;P&gt;HTH&lt;/P&gt;&lt;P&gt;*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.&lt;/P&gt;</description>
      <pubDate>Wed, 12 Sep 2018 20:44:09 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-SQL-Limit-clause-performance-issues/m-p/197830#M83384</guid>
      <dc:creator>falbani</dc:creator>
      <dc:date>2018-09-12T20:44:09Z</dc:date>
    </item>
    <item>
      <title>Re: Spark SQL: Limit clause performance issues</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-SQL-Limit-clause-performance-issues/m-p/197831#M83385</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/11048/falbani.html" nodeid="11048"&gt;@Felix Albani&lt;/A&gt; Thank you for your help! Without the LIMIT clause, the Job works perfectly (and in parallel).&lt;/P&gt;</description>
      <pubDate>Fri, 14 Sep 2018 17:27:48 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-SQL-Limit-clause-performance-issues/m-p/197831#M83385</guid>
      <dc:creator>dmueller1607</dc:creator>
      <dc:date>2018-09-14T17:27:48Z</dc:date>
    </item>
  </channel>
</rss>

