Issue of copying data from kudu to hdfs using spark sql

drake4 — Wed, 13 Nov 2019 02:39:11 GMT

I have a kudu table with schema:

create table test_table ( `time` timestamp not null, -- `id` string not null, -- ..... primary key(`time`,`id`) ) partition by hash(id) partitions 6 stored as kudu;

and I try to use spark to copy the data to a parquet table in hdfs:

val df = spark.read.options(Map("kudu.master" -> kuduMasters, "kudu.table" -> KuduTable)).format("kudu").load .where("time> '2019-10-29 08:05:10' AND time < '2019-10-29 08:05:30'") df.write .mode("append") .parquet("hdfs://parquet")

But the performance is low and the job seems to be doing a full table scan against the kudu table (from spark UI, I can see the "Scan Kudu impala::table" is the number of entire table).
For comparison I did a copy using impala's "insert into from" which is much faster and the "where" predicate seems to be working.
Is this full table scan behavior expected or am I missing something here? The kudu version is 1.10.0 and spark client is kudu-spark2_2.11:1.10.0

Re: Issue of copying data from kudu to hdfs using spark sql

Grant Henke — Wed, 13 Nov 2019 14:17:34 GMT

Can you try explicitly casting the string value to a timestamp?

I don't think Spark will push down the timestamp predicate if it's a string. This is tracked in https://issues.apache.org/jira/browse/KUDU-2821.

Re: Issue of copying data from kudu to hdfs using spark sql

drake4 — Thu, 14 Nov 2019 00:48:45 GMT

Hi @Grant Henke

The timestamp predicate works after I cast it to timestamp, thank you for your help!

question Re: Issue of copying data from kudu to hdfs using spark sql in Support Questions

Issue of copying data from kudu to hdfs using spark sql

Re: Issue of copying data from kudu to hdfs using spark sql

Re: Issue of copying data from kudu to hdfs using spark sql