04-03-2019 12:20 AM - last edited on 04-04-2019 06:34 AM by cjervis
Hi, we use Impala with Parquet and are considering switching to Kudu for having the benefit of a mutable persistence layer.
We currently write the parquet files manually in java code and upload them to HDFS.
For Kudu, i tested switching this to using the KuduClient for inserts but if i insert a Timestamp with "new Date()" and then SELECT the inserted row in Impala, the timestamp is one hour off, so i assume its a timezone issue.
Testing the same by executing an INSERT INTO statement via impala jdbc, the result is perfectly fine.
However, it looks like there are serious performance issues when using Batch Inserts via jdbc (70 rows per second) versus using the kuduClient directly (>1000 rows per second). thats why i would prefer the kudu client.
Can someone give me a hint on how to solve the timestamp issue? should that be done in code or can i somehow setup kudu and impala to have the same timezone?
I checked already the server dates and they are fine.
Don't know if that helps, but we have also oozie running on this cluster and there we have the same issue when scheduling workflows, which is in this case not a big bummer.
thanks a lot in advance for your help
04-03-2019 01:11 PM
When you said you were using 'new Date()', is it java.util.Date? If so, can you try 'SimpleDateFormat' where you can set time zone by 'setTimeZone', example can be found here.
I also notice the Impala doc stating that 'The conversion between the Impala 96-bit representation and the Kudu 64-bit representation introduces some performance overhead when reading or writing TIMESTAMP columns. You can minimize the overhead during writes by performing inserts through the Kudu API. Because the overhead during reads applies to each query, you might continue to use a BIGINT column to represent date/time values in performance-critical applications.' So I guess that is why you are seeing insert performance difference. And if you care about query performance as well, the other option is to use 'BIGINT' column.
04-03-2019 01:22 PM - edited 04-03-2019 01:29 PM
I suspect the reason that your timestamps are off by one hour is that Impala stores timestamps in Kudu as UTC (stored in Impala as a 96 bit int with nanosecond precision) converted to/from unix time (stored in Kudu as a 64 bit int with microsecond precision). So you should be able to solve that issue by treating all timestamps in your application as UTC. The discussion in this thread may be useful: https://lists.apache.org/thread.html/bb4ef37c88e76959399f40c7053a76b644217e76664982a60c703c7e@%3Cuse...
For performance, I'm interested in more details: if you're doing something like 'insert into <kudu_table> values (...)' to insert a few rows at a time through Impala, then you'll definitely get better performance by going through the Kudu API, as going through Impala you pay extra cost for query parsing and planning, etc. Impala is more suited to doing things like 'insert in <kudu_table> select * from <some_hdfs_table>'
And as Hao pointed out, there is overhead going through Impala because of the conversion from 96 bit UTC to 64 bit unix time, so you may want to make the Impala type a bigint and only convert to/from timestamps when necessary
04-05-2019 03:24 AM
thanks a lot for your fast answers. I will try the timezone conversion in code to UTC and update here as soon as i have results.
in the meantime i want to provide more info about the slowness of insertion of data with impala jdbc :
with the java kudu client, i used the default flush mode.
with impala jdbc, i tried out various things: single insert statements, which were doomed to be slow by the given explanation and also what i expected.
batch insert statements with different batch sizes. i assumed and hoped the overhead of query planning beforehand will be armortized by the batch size, but unfortunately i could not observe such a behaviour and i ended up with a very similar rows/s throughput, no matter if batch size was 1K, 5K, 10K and 50K rows.
but i have to admit, my test was quite opinionated since i would really prefer to use the kudu client directly, hence i haven't tried that many things with impala jdbc since my main goal was to overcome the timezone issue.
Also, the environment in which i'm running this tests is very light resourcewise, but since i let the different approaches compete in the same environment, i think it still gives me a good idea.
04-05-2019 09:59 AM - edited 04-05-2019 10:00 AM