Created 11-12-2019 06:39 PM
I have a kudu table with schema:
create table test_table
(
`time` timestamp not null, --
`id` string not null, --
.....
primary key(`time`,`id`)
)
partition by hash(id) partitions 6
stored as kudu;
and I try to use spark to copy the data to a parquet table in hdfs:
val df = spark.read.options(Map("kudu.master" -> kuduMasters,
"kudu.table" -> KuduTable)).format("kudu").load
.where("time> '2019-10-29 08:05:10' AND time < '2019-10-29 08:05:30'")
df.write
.mode("append")
.parquet("hdfs://parquet")
But the performance is low and the job seems to be doing a full table scan against the kudu table (from spark UI, I can see the "Scan Kudu impala::table" is the number of entire table).
For comparison I did a copy using impala's "insert into from" which is much faster and the "where" predicate seems to be working.
Is this full table scan behavior expected or am I missing something here? The kudu version is 1.10.0 and spark client is kudu-spark2_2.11:1.10.0
Created 11-13-2019 06:17 AM
Can you try explicitly casting the string value to a timestamp?
I don't think Spark will push down the timestamp predicate if it's a string. This is tracked in https://issues.apache.org/jira/browse/KUDU-2821.
Created 11-13-2019 06:17 AM
Can you try explicitly casting the string value to a timestamp?
I don't think Spark will push down the timestamp predicate if it's a string. This is tracked in https://issues.apache.org/jira/browse/KUDU-2821.
Created 11-13-2019 04:48 PM
Hi @Grant Henke
The timestamp predicate works after I cast it to timestamp, thank you for your help!