Member since
11-29-2016
23
Posts
2
Kudos Received
0
Solutions
08-21-2017
05:35 AM
we are running cdh 5.9.0(impala 2.7.0, hive 1.1.0). we know that, while quering timestamp fields(parquet table generated by hive) with impala, we may get different result(vs hive) because of the timezone reason. the startup args of our impala is: convert_legacy_hive_parquet_utc_timestamps=false use_local_tz_for_unix_timestamp_conversions . what we confuse is that, whatever true/false we set hive.parquet.timestamp.skip.conversion while generating different parquet table in hive, we got the same timestamp result while doing query with impala from both of the generated table. what we expect is that, when the value of hive.parquet.timestamp.skip.conversion is different, the result should be different. but it just don't perform in this way. we are really confuse about this, any reply will be appreciate. bellowing steps is a test: CREATE TABLE test_timestamp (ts TIMESTAMP) STORED AS TEXTFILE; CREATE TABLE test_ts_skip_conversion_true_parquet (ts TIMESTAMP) STORED AS TEXTFILE; CREATE TABLE test_ts_skip_conversion_false_parquet (ts TIMESTAMP) STORED AS TEXTFILE; step1, load data into and query from test_timestamp step2, select data into test_ts_skip_conversion_true_parquet ( hive.parquet.timestamp.skip.conversion=true ) step3, select data into test_ts_skip_conversion_false_parquet ( hive.parquet.timestamp.skip.conversion=false ) step4, while query test_ts_skip_conversion_true_parquet and test_ts_skip_conversion_false_parquet with impala, we got the same result, but we expect different result here!
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Impala
08-16-2017
07:12 PM
Is any one encountered the same problem?
... View more
08-14-2017
07:28 PM
we are running cdh 5.10 cluster with about 500 nodes, we installed nn/jn/zk as bellow: host1: namenode, journal node, zookeeper host2: namenode, journal node, zookeeper host3: journal node, zookeeper (no other service is install on these three host) namenode's fsimage & editlog storage dir is configed at "/data1/dfs/nn/" and "/data2/dfs/nn/" journal node's editlog storage dir is configed at "/data3/dfs/nn/" /data1 and /data2 is mount at individual disk driver when we look into the namenode's log, we found that namenode takes long time to flush editlog to journalnode, and flush to local disk driver is not taking that much time.as seeing from the log, flush to journal node is about 4x~5x times longer than flush to local disk. bellow is some log snippets: --snippet 1 2017-08-15 09:50:12,946 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 326245 Total time for transactions(ms): 3997 Number of transactions batched in Syncs: 284851 Number of syncs: 41371 SyncTimes(ms): 14194 3543 3397 2017-08-15 09:51:12,951 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 533598 Total time for transactions(ms): 6599 Number of transactions batched in Syncs: 471796 Number of syncs: 61757 SyncTimes(ms): 21695 5450 5132 2017-08-15 09:52:12,951 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 692744 Total time for transactions(ms): 11074 Number of transactions batched in Syncs: 610178 Number of syncs: 82561 SyncTimes(ms): 31668 7356 6787 2017-08-15 09:53:12,953 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 843165 Total time for transactions(ms): 14348 Number of transactions batched in Syncs: 742316 Number of syncs: 100838 SyncTimes(ms): 40082 10075 8269 2017-08-15 09:54:15,374 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 920690 Total time for transactions(ms): 15248 Number of transactions batched in Syncs: 808311 Number of syncs: 110467 SyncTimes(ms): 43884 39289 10217 2017-08-15 09:54:30,821 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 920690 Total time for transactions(ms): 15248 Number of transactions batched in Syncs: 810222 Number of syncs: 110468 SyncTimes(ms): 43910 39428 38668 2017-08-15 09:55:30,821 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 206372 Total time for transactions(ms): 2562 Number of transactions batched in Syncs: 160222 Number of syncs: 46144 SyncTimes(ms): 16389 5156 3410 --snippet 2 2017-08-15 09:59:14,716 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 175114 Total time for transactions(ms): 2762 Number of transactions batched in Syncs: 89590 Number of syncs: 85521 SyncTimes(ms): 29138 6897 5872 2017-08-15 10:00:14,716 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 221609 Total time for transactions(ms): 3499 Number of transactions batched in Syncs: 112617 Number of syncs: 108989 SyncTimes(ms): 38074 9056 7451 2017-08-15 10:01:45,172 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 251986 Total time for transactions(ms): 4000 Number of transactions batched in Syncs: 130562 Number of syncs: 121108 SyncTimes(ms): 42776 44831 14277 is this editlog sync behave normal? how can we speed up the flush operation when flush to journal node? Any reply is appreciated.
... View more
Labels:
- Labels:
-
HDFS
07-09-2017
08:51 PM
No, we are running hiveserver2 with ldap and sentry.
... View more
07-09-2017
06:39 PM
We are currently running hive(hiveserver2) with sentry, and user impersonation is disable. When any user connect to hiveserver2 and submit queries, hiveserver2 will submit all the query jobs to yarn, as the same user hive, not the actual the user who connect to hiveserver2. Is there any way that can let hiveserver2 submit jobs as the actual user?
... View more
Labels:
- Labels:
-
Apache Hive
07-02-2017
07:57 PM
I have been re-run the test, and kudu perform much better this time(though it's still a little bit slower than parquet), thanks for @mpercy's suggestion. I changed two things by re-runing the test: 1, increase the partitions for the fact table from 60 to 768(affact all queries) 2, change the query3.sql 'or' predicate into 'in' predicate, so predicate can push down to kudu(only affact query 3) below is the re-run result: (column 'kudu60' is the previous result, which means the partitions of fact table is 60 ) (column 'kudu768' is the new result, which means the partitions of fact table is 768 )
... View more
06-28-2017
02:44 AM
This is a good suggestion, we are under the scale limits. We may run another test in a later time, e.g. increasing # of partitions...
... View more
06-27-2017
09:05 PM
1, Make sure you run COMPUTE STATS: yes, we do this after loading data 2, What is the total size of your data set? impala tpc-ds tool create 9 dim tables and 1 fact table, which dim tables are small(record num from 1k to 4million+ according to the datasize generated ), and the fact table is big, here is the 'data siez-->record num' of fact table: 512g<-->4224587147 256g<-->2112281549 64g<-->528071062 3, Can you also share how you partitioned your Kudu table? for the dim tables, we hash partition it into 2 partitions by their primary (no partition for parquet table), for the fact table, we range partition it into 60 partitions by its 'data field'(parquet partition into 1800+ partitions), for those tables create in kudu, their replication factor is 3.
... View more
06-26-2017
11:25 PM
1 Kudo
Thanks all for your reply, here is some detail about the testing. We are running impalad+kudu on 14 nodes, nodes info: cpu model : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz cpu cores: 32 mem: 128G disk: 4T*12, sas impalad and kudu are installed on each node, with 16G MEM for kudu, and 96G MEM for impalad. parquet files are stored on another hadoop cluster with about 80+ nodes(running hdfs+yarn). We are running tpc-ds queries(https://github.com/cloudera/impala-tpcds-kit) . With the 18 queries, each query were run with 3 times, (3 times on impala+kudu, 3 times on impala+parquet)and then we caculate the average time. While compare to the average query time of each query,we found that kudu is slower than parquet. Here is the result of the 18 queries: We are planing to setup an olap system, so we compare impala+kudu vs impala+parquet to see which is the good choice.
... View more
06-26-2017
01:00 AM
While we doing tpc-ds testing on impala+kudu vs impala+parquet(according to https://github.com/cloudera/impala-tpcds-kit), we found that for most of the queries, impala+parquet is 2times~10times faster than impala+kudu. Is any body ever did the same testing? ps:We are running kudu 1.3.0 with cdh 5.10.
... View more
Labels:
- Labels:
-
Apache Impala
-
Apache Kudu
06-25-2017
07:12 PM
1 Kudo
Finally I found that 'or' predicate will not push down to kudu: explain select * from student where age=10 or age=20 or age=50 or age=60; +------------------------------------------------------------------------------------+ | Explain String | +------------------------------------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=0B VCores=1 | | WARNING: The following tables are missing relevant table and/or column statistics. | | preresearch.student | | | | PLAN-ROOT SINK | | | | | 01:EXCHANGE [UNPARTITIONED] | | | | | 00:SCAN KUDU [preresearch.student] | | predicates: age = 10 OR age = 20 OR age = 50 OR age = 60 | +------------------------------------------------------------------------------------+
... View more
06-23-2017
02:20 AM
while reading the "using impala with kudu" document, it's saying that: "If the WHERE clause of your query includes comparisons with the operators =, <=, '\<', '\>', >=, BETWEEN, or IN, Kudu evaluates the condition directly and only returns the relevant results. This provides optimum performance, because Kudu only returns the relevant results to Impala." But here, with tpc-ds query3, between predicate is not push down to kudu. Is that anything wrong?
... View more
06-23-2017
02:04 AM
We are running kudu 1.3.0 with cdh 5.10(the kudu client version suppose to be 1.2). When we doing tpc-ds query with impala on kudu(according to https://github.com/cloudera/impala-tpcds-kit), we found that the 'query 3 between predicate' is not push down to kudu, cause kudu scan many rows return to impala. The following is what we found in impala query profile: tpc-ds q3.sql snippets: any reply will be appreciate.
... View more
Labels:
04-11-2017
06:36 PM
Got it, thanks a lot.
... View more
04-11-2017
02:31 AM
I'm new to kudu. As describe in document, kudu is a column oriented storage engine, and it support sql query when integrated with impala. My question is that: is impala sql syntax fully support when querying kudu via impala sql? e.g. is SQL-92 fully support while querying kudu ? Any answer will be appreciated.
... View more
Labels:
- Labels:
-
Apache Impala
-
Apache Kudu
02-06-2017
10:22 PM
Got it, thanks a lot.
... View more
02-06-2017
12:35 AM
"This allows for better recovery as then it fails to commit it still has the edit in the local directory..." In the fails to commit case, is the failed commit edit exist in memory(double buffer) or not? If it's still there, then QJM can try to commit again using the edits in memory. Thanks for your reply, and look for you futher replay.
... View more
02-03-2017
02:04 AM
Namenode HA config share edit with QJM, the active NN will write edits to QJM and local directory. When NN starting up, it reads fsimage from local directory, and edits from QJM. As NN is not reading edits from local directory, why it still writing edits to a local directory?
... View more
Labels:
- Labels:
-
HDFS