Member since
04-02-2019
3
Posts
0
Kudos Received
0
Solutions
04-05-2019
06:05 AM
Hi, thanks a lot for the hint with SimpleDateFormat. Although it feels a bit weird, it works 🙂
... View more
04-05-2019
03:24 AM
Hi, thanks a lot for your fast answers. I will try the timezone conversion in code to UTC and update here as soon as i have results. in the meantime i want to provide more info about the slowness of insertion of data with impala jdbc : with the java kudu client, i used the default flush mode. with impala jdbc, i tried out various things: single insert statements, which were doomed to be slow by the given explanation and also what i expected. batch insert statements with different batch sizes. i assumed and hoped the overhead of query planning beforehand will be armortized by the batch size, but unfortunately i could not observe such a behaviour and i ended up with a very similar rows/s throughput, no matter if batch size was 1K, 5K, 10K and 50K rows. but i have to admit, my test was quite opinionated since i would really prefer to use the kudu client directly, hence i haven't tried that many things with impala jdbc since my main goal was to overcome the timezone issue. Also, the environment in which i'm running this tests is very light resourcewise, but since i let the different approaches compete in the same environment, i think it still gives me a good idea.
... View more
04-03-2019
12:20 AM
Hi, we use Impala with Parquet and are considering switching to Kudu for having the benefit of a mutable persistence layer.
We currently write the parquet files manually in java code and upload them to HDFS.
For Kudu, i tested switching this to using the KuduClient for inserts but if i insert a Timestamp with "new Date()" and then SELECT the inserted row in Impala, the timestamp is one hour off, so i assume its a timezone issue.
Testing the same by executing an INSERT INTO statement via impala jdbc, the result is perfectly fine.
However, it looks like there are serious performance issues when using Batch Inserts via jdbc (70 rows per second) versus using the kuduClient directly (>1000 rows per second). thats why i would prefer the kudu client.
Can someone give me a hint on how to solve the timestamp issue? should that be done in code or can i somehow setup kudu and impala to have the same timezone?
I checked already the server dates and they are fine.
Don't know if that helps, but we have also oozie running on this cluster and there we have the same issue when scheduling workflows, which is in this case not a big bummer.
thanks a lot in advance for your help
... View more
Labels:
- Labels:
-
Apache Impala
-
Apache Kudu
-
HDFS