Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Issue of copying data from kudu to hdfs using spark sql

avatar
Explorer

I have a kudu table with schema:

 

 

create table test_table
(
    `time` timestamp not null, --
    `id` string not null, --
    .....
    
primary key(`time`,`id`)
)
partition by hash(id) partitions 6
stored as kudu;

 

and I try to use spark to copy the data to a parquet table in hdfs:

 

 val df = spark.read.options(Map("kudu.master" -> kuduMasters,
        "kudu.table" -> KuduTable)).format("kudu").load
        .where("time> '2019-10-29 08:05:10' AND time < '2019-10-29 08:05:30'")

 df.write
        .mode("append")
        .parquet("hdfs://parquet")

 

But the performance is low and the job seems to be doing a full table scan against the kudu table (from spark UI, I can see the "Scan Kudu impala::table" is the number of entire table).
For comparison I did a copy using impala's "insert into from" which is much faster and the "where" predicate seems to be working. 
Is this full table scan behavior expected or am I missing something here? The kudu version is 1.10.0 and spark client is kudu-spark2_2.11:1.10.0

1 ACCEPTED SOLUTION

avatar
Cloudera Employee

Can you try explicitly casting the string value to a timestamp?

 

I don't think Spark will push down the timestamp predicate if it's a string. This is tracked in https://issues.apache.org/jira/browse/KUDU-2821.

View solution in original post

2 REPLIES 2

avatar
Cloudera Employee

Can you try explicitly casting the string value to a timestamp?

 

I don't think Spark will push down the timestamp predicate if it's a string. This is tracked in https://issues.apache.org/jira/browse/KUDU-2821.

avatar
Explorer

Hi @Grant Henke 

 

The timestamp predicate works after I cast it to timestamp, thank you for your help!