Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Issue of copying data from kudu to hdfs using spark sql

avatar
Explorer

I have a kudu table with schema:

 

 

create table test_table
(
    `time` timestamp not null, --
    `id` string not null, --
    .....
    
primary key(`time`,`id`)
)
partition by hash(id) partitions 6
stored as kudu;

 

and I try to use spark to copy the data to a parquet table in hdfs:

 

 val df = spark.read.options(Map("kudu.master" -> kuduMasters,
        "kudu.table" -> KuduTable)).format("kudu").load
        .where("time> '2019-10-29 08:05:10' AND time < '2019-10-29 08:05:30'")

 df.write
        .mode("append")
        .parquet("hdfs://parquet")

 

But the performance is low and the job seems to be doing a full table scan against the kudu table (from spark UI, I can see the "Scan Kudu impala::table" is the number of entire table).
For comparison I did a copy using impala's "insert into from" which is much faster and the "where" predicate seems to be working. 
Is this full table scan behavior expected or am I missing something here? The kudu version is 1.10.0 and spark client is kudu-spark2_2.11:1.10.0

1 ACCEPTED SOLUTION

avatar
Cloudera Employee
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
2 REPLIES 2

avatar
Cloudera Employee
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Explorer

Hi @Grant Henke 

 

The timestamp predicate works after I cast it to timestamp, thank you for your help!