Support Questions

drake4 · ‎11-12-2019

I have a kudu table with schema:

create table test_table
(
    `time` timestamp not null, --
    `id` string not null, --
    .....
    
primary key(`time`,`id`)
)
partition by hash(id) partitions 6
stored as kudu;

and I try to use spark to copy the data to a parquet table in hdfs:

 val df = spark.read.options(Map("kudu.master" -> kuduMasters,
        "kudu.table" -> KuduTable)).format("kudu").load
        .where("time> '2019-10-29 08:05:10' AND time < '2019-10-29 08:05:30'")

 df.write
        .mode("append")
        .parquet("hdfs://parquet")

But the performance is low and the job seems to be doing a full table scan against the kudu table (from spark UI, I can see the "Scan Kudu impala::table" is the number of entire table).
For comparison I did a copy using impala's "insert into from" which is much faster and the "where" predicate seems to be working.
Is this full table scan behavior expected or am I missing something here? The kudu version is 1.10.0 and spark client is kudu-spark2_2.11:1.10.0

Grant Henke · ‎11-13-2019

Can you try explicitly casting the string value to a timestamp?

I don't think Spark will push down the timestamp predicate if it's a string. This is tracked in https://issues.apache.org/jira/browse/KUDU-2821.

View solution in original post

Grant Henke · ‎11-13-2019

Can you try explicitly casting the string value to a timestamp?

I don't think Spark will push down the timestamp predicate if it's a string. This is tracked in https://issues.apache.org/jira/browse/KUDU-2821.

drake4 · ‎11-13-2019

Hi @Grant Henke

The timestamp predicate works after I cast it to timestamp, thank you for your help!

Cloudera Community

Support Questions

Issue of copying data from kudu to hdfs using spark sql