<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Issue of copying data from kudu to hdfs using spark sql in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Issue-of-copying-data-from-kudu-to-hdfs-using-spark-sql/m-p/282875#M210257</link>
    <description>&lt;P&gt;Can you try explicitly casting the string value to a timestamp?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I don't think Spark will push down the timestamp predicate if it's a string. This is tracked in&amp;nbsp;&lt;A href="https://issues.apache.org/jira/browse/KUDU-2821" target="_blank"&gt;https://issues.apache.org/jira/browse/KUDU-2821&lt;/A&gt;.&lt;/P&gt;</description>
    <pubDate>Wed, 13 Nov 2019 14:17:34 GMT</pubDate>
    <dc:creator>Grant Henke</dc:creator>
    <dc:date>2019-11-13T14:17:34Z</dc:date>
    <item>
      <title>Issue of copying data from kudu to hdfs using spark sql</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Issue-of-copying-data-from-kudu-to-hdfs-using-spark-sql/m-p/282814#M210210</link>
      <description>&lt;P&gt;I have a kudu table with schema:&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;create table test_table
(
    `time` timestamp not null, --
    `id` string not null, --
    .....
    
primary key(`time`,`id`)
)
partition by hash(id) partitions 6
stored as kudu;&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;and I try to use spark to copy the data to a parquet table in hdfs:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="java"&gt; val df = spark.read.options(Map("kudu.master" -&amp;gt; kuduMasters,
        "kudu.table" -&amp;gt; KuduTable)).format("kudu").load
        .where("time&amp;gt; '2019-10-29 08:05:10' AND time &amp;lt; '2019-10-29 08:05:30'")

 df.write
        .mode("append")
        .parquet("hdfs://parquet")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;But the performance is low and the job seems to be doing a full table scan against the kudu table (from spark UI, I can see the "&lt;SPAN&gt;Scan Kudu impala::table&lt;/SPAN&gt;" is the number of entire table).&lt;BR /&gt;For comparison I did a copy using impala's "insert into from" which is much faster and the "where" predicate seems to be working.&amp;nbsp;&lt;BR /&gt;Is this full table scan behavior expected or am I missing something here?&amp;nbsp;The kudu version is 1.10.0 and spark client is&amp;nbsp;kudu-spark2_2.11:1.10.0&lt;/P&gt;</description>
      <pubDate>Wed, 13 Nov 2019 02:39:11 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Issue-of-copying-data-from-kudu-to-hdfs-using-spark-sql/m-p/282814#M210210</guid>
      <dc:creator>drake4</dc:creator>
      <dc:date>2019-11-13T02:39:11Z</dc:date>
    </item>
    <item>
      <title>Re: Issue of copying data from kudu to hdfs using spark sql</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Issue-of-copying-data-from-kudu-to-hdfs-using-spark-sql/m-p/282875#M210257</link>
      <description>&lt;P&gt;Can you try explicitly casting the string value to a timestamp?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I don't think Spark will push down the timestamp predicate if it's a string. This is tracked in&amp;nbsp;&lt;A href="https://issues.apache.org/jira/browse/KUDU-2821" target="_blank"&gt;https://issues.apache.org/jira/browse/KUDU-2821&lt;/A&gt;.&lt;/P&gt;</description>
      <pubDate>Wed, 13 Nov 2019 14:17:34 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Issue-of-copying-data-from-kudu-to-hdfs-using-spark-sql/m-p/282875#M210257</guid>
      <dc:creator>Grant Henke</dc:creator>
      <dc:date>2019-11-13T14:17:34Z</dc:date>
    </item>
    <item>
      <title>Re: Issue of copying data from kudu to hdfs using spark sql</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Issue-of-copying-data-from-kudu-to-hdfs-using-spark-sql/m-p/282918#M210289</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/12479"&gt;@Grant Henke&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The timestamp predicate works after I cast it to timestamp, thank you for your help!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 14 Nov 2019 00:48:45 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Issue-of-copying-data-from-kudu-to-hdfs-using-spark-sql/m-p/282918#M210289</guid>
      <dc:creator>drake4</dc:creator>
      <dc:date>2019-11-14T00:48:45Z</dc:date>
    </item>
  </channel>
</rss>

