Support Questions

Find answers, ask questions, and share your expertise

Issue of copying data from kudu to hdfs using spark sql

avatar
Explorer

I have a kudu table with schema:

 

 

create table test_table
(
    `time` timestamp not null, --
    `id` string not null, --
    .....
    
primary key(`time`,`id`)
)
partition by hash(id) partitions 6
stored as kudu;

 

and I try to use spark to copy the data to a parquet table in hdfs:

 

 val df = spark.read.options(Map("kudu.master" -> kuduMasters,
        "kudu.table" -> KuduTable)).format("kudu").load
        .where("time> '2019-10-29 08:05:10' AND time < '2019-10-29 08:05:30'")

 df.write
        .mode("append")
        .parquet("hdfs://parquet")

 

But the performance is low and the job seems to be doing a full table scan against the kudu table (from spark UI, I can see the "Scan Kudu impala::table" is the number of entire table).
For comparison I did a copy using impala's "insert into from" which is much faster and the "where" predicate seems to be working. 
Is this full table scan behavior expected or am I missing something here? The kudu version is 1.10.0 and spark client is kudu-spark2_2.11:1.10.0

1 ACCEPTED SOLUTION

avatar
Cloudera Employee

Can you try explicitly casting the string value to a timestamp?

 

I don't think Spark will push down the timestamp predicate if it's a string. This is tracked in https://issues.apache.org/jira/browse/KUDU-2821.

View solution in original post

2 REPLIES 2

avatar
Cloudera Employee

Can you try explicitly casting the string value to a timestamp?

 

I don't think Spark will push down the timestamp predicate if it's a string. This is tracked in https://issues.apache.org/jira/browse/KUDU-2821.

avatar
Explorer

Hi @Grant Henke 

 

The timestamp predicate works after I cast it to timestamp, thank you for your help!