<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question spark physical plan not generating correct  sql DAG in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/spark-physical-plan-not-generating-correct-sql-DAG/m-p/289480#M214286</link>
    <description>&lt;P&gt;Hi community&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am trying to debug a simple query in spark SQL that is returning incorrect data.&lt;/P&gt;&lt;P&gt;In this instance the query is a simple join between two hive tables .. The issue seems tied to the fact that a the physical plan that spark has generated (with catalyst&amp;nbsp; optimization) looks to be corrupted where some of the steps in the physical plan have not been assigned an evaluation order id and thus all evaluation on the right side of the join is not completed in the spark query&lt;/P&gt;&lt;P&gt;This error is on a HDP3.1.4 cluster running&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;&amp;gt;&amp;gt;&amp;gt; sc.version
u'2.3.2.3.1.4.0-315'&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;here is the example query ..&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from pyspark_llap import HiveWarehouseSession
hive = HiveWarehouseSession.session(spark).build()

filter_1 = hive.executeQuery('select * from 03_score where scores = 5 or scores = 6')
filter_2 = hive.executeQuery('select * from 03_score where scores = 8')


joined_df = filter_1.alias('o').join(filter_2.alias('po'), filter_1.encntr_id == filter_2.encntr_id, how='inner')
joined_df.count() ### shows incorrect value ### 
joined_df.explain(True)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;and here is the output of plan evaluation&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;== Physical Plan ==
 SortMergeJoin [encntr_id#0], [encntr_id#12], Inner
:- *(2) Sort [encntr_id#0 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(encntr_id#0, 200)
:     +- *(1) Filter isnotnull(encntr_id#0)
:        +- *(1) DataSourceV2Scan [encntr_id#0, scores_datetime#1, scores#2], com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataSourceReader@a6df563
+-  Sort [encntr_id#12 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(encntr_id#12, 200)
      +-  Filter isnotnull(encntr_id#12)
         +- DataSourceV2Scan [encntr_id#12, dateofbirth#13, postcode#14, event_desc#15, event_performed_dt_tm#16], com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataSourceReader@60dd22d9&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Note that all datasourceV2scan , filter exchange and sort on the right side of the join have not been assigned an order id&amp;nbsp; and therefore never computed.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Can anyone shed some light on this issue for me .. Why would the physical plan which looks correct not be assigned an evaluation order id ?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 10 Feb 2020 21:23:43 GMT</pubDate>
    <dc:creator>Ryanp</dc:creator>
    <dc:date>2020-02-10T21:23:43Z</dc:date>
    <item>
      <title>spark physical plan not generating correct  sql DAG</title>
      <link>https://community.cloudera.com/t5/Support-Questions/spark-physical-plan-not-generating-correct-sql-DAG/m-p/289480#M214286</link>
      <description>&lt;P&gt;Hi community&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am trying to debug a simple query in spark SQL that is returning incorrect data.&lt;/P&gt;&lt;P&gt;In this instance the query is a simple join between two hive tables .. The issue seems tied to the fact that a the physical plan that spark has generated (with catalyst&amp;nbsp; optimization) looks to be corrupted where some of the steps in the physical plan have not been assigned an evaluation order id and thus all evaluation on the right side of the join is not completed in the spark query&lt;/P&gt;&lt;P&gt;This error is on a HDP3.1.4 cluster running&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;&amp;gt;&amp;gt;&amp;gt; sc.version
u'2.3.2.3.1.4.0-315'&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;here is the example query ..&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from pyspark_llap import HiveWarehouseSession
hive = HiveWarehouseSession.session(spark).build()

filter_1 = hive.executeQuery('select * from 03_score where scores = 5 or scores = 6')
filter_2 = hive.executeQuery('select * from 03_score where scores = 8')


joined_df = filter_1.alias('o').join(filter_2.alias('po'), filter_1.encntr_id == filter_2.encntr_id, how='inner')
joined_df.count() ### shows incorrect value ### 
joined_df.explain(True)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;and here is the output of plan evaluation&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;== Physical Plan ==
 SortMergeJoin [encntr_id#0], [encntr_id#12], Inner
:- *(2) Sort [encntr_id#0 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(encntr_id#0, 200)
:     +- *(1) Filter isnotnull(encntr_id#0)
:        +- *(1) DataSourceV2Scan [encntr_id#0, scores_datetime#1, scores#2], com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataSourceReader@a6df563
+-  Sort [encntr_id#12 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(encntr_id#12, 200)
      +-  Filter isnotnull(encntr_id#12)
         +- DataSourceV2Scan [encntr_id#12, dateofbirth#13, postcode#14, event_desc#15, event_performed_dt_tm#16], com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataSourceReader@60dd22d9&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Note that all datasourceV2scan , filter exchange and sort on the right side of the join have not been assigned an order id&amp;nbsp; and therefore never computed.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Can anyone shed some light on this issue for me .. Why would the physical plan which looks correct not be assigned an evaluation order id ?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 10 Feb 2020 21:23:43 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/spark-physical-plan-not-generating-correct-sql-DAG/m-p/289480#M214286</guid>
      <dc:creator>Ryanp</dc:creator>
      <dc:date>2020-02-10T21:23:43Z</dc:date>
    </item>
    <item>
      <title>Re: spark physical plan not generating correct  sql DAG</title>
      <link>https://community.cloudera.com/t5/Support-Questions/spark-physical-plan-not-generating-correct-sql-DAG/m-p/289618#M214369</link>
      <description>&lt;P&gt;Figured this out internally .&amp;nbsp;&lt;/P&gt;&lt;P&gt;Turns out the spark optimization routine can be affected by the configuration setting&amp;nbsp;&lt;/P&gt;&lt;P&gt;spark.sql.codegen.Maxfields&lt;/P&gt;&lt;P&gt;which&amp;nbsp; can have implications in how spark will optimize the&amp;nbsp; read&amp;nbsp; from 'fat' tables .&lt;/P&gt;&lt;P&gt;In my case the setting was set low which means DAG stages of the read from the right side of the join (the "fat" table) were performed without being assigned to a wholestage codegen .&lt;/P&gt;&lt;P&gt;Important to note that the read of the hive data in either case returned the same results just with a different optimization applied to the physical plan .&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 12 Feb 2020 10:25:29 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/spark-physical-plan-not-generating-correct-sql-DAG/m-p/289618#M214369</guid>
      <dc:creator>Ryanp</dc:creator>
      <dc:date>2020-02-12T10:25:29Z</dc:date>
    </item>
  </channel>
</rss>

