About lewiss

lewiss · ‎08-21-2017

we are running cdh 5.9.0(impala 2.7.0, hive 1.1.0). we know that, while quering timestamp fields(parquet table generated by hive) with impala, we may get different result(vs hive) because of the timezone reason. the startup args of our impala is: convert_legacy_hive_parquet_utc_timestamps=false use_local_tz_for_unix_timestamp_conversions. what we confuse is that, whatever true/false we set hive.parquet.timestamp.skip.conversion while generating different parquet table in hive, we got the same timestamp result while doing query with impala from both of the generated table. what we expect is that, when the value of hive.parquet.timestamp.skip.conversion is different, the result should be different. but it just don't perform in this way. we are really confuse about this, any reply will be appreciate. bellowing steps is a test: CREATE TABLE test_timestamp (ts TIMESTAMP) STORED AS TEXTFILE; CREATE TABLE test_ts_skip_conversion_true_parquet (ts TIMESTAMP) STORED AS TEXTFILE; CREATE TABLE test_ts_skip_conversion_false_parquet (ts TIMESTAMP) STORED AS TEXTFILE; step1, load data into and query from test_timestamp step2, select data into test_ts_skip_conversion_true_parquet (hive.parquet.timestamp.skip.conversion=true) step3, select data into test_ts_skip_conversion_false_parquet (hive.parquet.timestamp.skip.conversion=false) step4, while query test_ts_skip_conversion_true_parquet and test_ts_skip_conversion_false_parquet with impala, we got the same result, but we expect different result here!

lewiss · ‎07-02-2017

I have been re-run the test, and kudu perform much better this time(though it's still a little bit slower than parquet), thanks for @mpercy's suggestion. I changed two things by re-runing the test: 1, increase the partitions for the fact table from 60 to 768(affact all queries) 2, change the query3.sql 'or' predicate into 'in' predicate, so predicate can push down to kudu(only affact query 3) below is the re-run result: (column 'kudu60' is the previous result, which means the partitions of fact table is 60 ) (column 'kudu768' is the new result, which means the partitions of fact table is 768)

lewiss · ‎06-28-2017

This is a good suggestion, we are under the scale limits. We may run another test in a later time, e.g. increasing # of partitions...

lewiss · ‎06-27-2017

1, Make sure you run COMPUTE STATS: yes, we do this after loading data 2, What is the total size of your data set? impala tpc-ds tool create 9 dim tables and 1 fact table, which dim tables are small(record num from 1k to 4million+ according to the datasize generated), and the fact table is big, here is the 'data siez-->record num' of fact table: 512g<-->4224587147 256g<-->2112281549 64g<-->528071062 3, Can you also share how you partitioned your Kudu table? for the dim tables, we hash partition it into 2 partitions by their primary (no partition for parquet table), for the fact table, we range partition it into 60 partitions by its 'data field'(parquet partition into 1800+ partitions), for those tables create in kudu, their replication factor is 3.

lewiss · ‎06-26-2017

Thanks all for your reply, here is some detail about the testing. We are running impalad+kudu on 14 nodes, nodes info: cpu model : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz cpu cores: 32 mem: 128G disk: 4T*12, sas impalad and kudu are installed on each node, with 16G MEM for kudu, and 96G MEM for impalad. parquet files are stored on another hadoop cluster with about 80+ nodes(running hdfs+yarn). We are running tpc-ds queries(https://github.com/cloudera/impala-tpcds-kit) . With the 18 queries, each query were run with 3 times, （3 times on impala+kudu, 3 times on impala+parquet）and then we caculate the average time. While compare to the average query time of each query,we found that kudu is slower than parquet. Here is the result of the 18 queries: We are planing to setup an olap system, so we compare impala+kudu vs impala+parquet to see which is the good choice.

lewiss · ‎06-26-2017

While we doing tpc-ds testing on impala+kudu vs impala+parquet(according to https://github.com/cloudera/impala-tpcds-kit), we found that for most of the queries, impala+parquet is 2times~10times faster than impala+kudu. Is any body ever did the same testing? ps:We are running kudu 1.3.0 with cdh 5.10.

lewiss · ‎04-11-2017

Got it, thanks a lot.

lewiss · ‎04-11-2017

I'm new to kudu. As describe in document, kudu is a column oriented storage engine, and it support sql query when integrated with impala. My question is that: is impala sql syntax fully support when querying kudu via impala sql? e.g. is SQL-92 fully support while querying kudu ? Any answer will be appreciated.

Online	Offline
Last Visited	‎11-27-2017 01:59 AM

Member Since	‎11-29-2016 01:41 AM
Last Visited	‎11-27-2017 01:59 AM
Posts	23
Kudos received	2

Cloudera Community

wharever value hive.parquet.timestamp.skip.convers...

Re: kudu is slower than parquet?

Re: kudu is slower than parquet?

Re: kudu is slower than parquet?

Re: kudu is slower than parquet?

kudu is slower than parquet?

Re: Is impala sql syntax fully support when queryi...

Is impala sql syntax fully support when querying k...