About Andreyeff

Andreyeff · ‎04-24-2019

Could you advise if there is a solution to the problem, when Impala assigns heavy query parts to busy executors. For example the following was faced at CDH 5.16 with Impala 2.12.0: Impala has several (let's say 5) executors each having ~100GB RAM. Impala admission control is used. The mem_limit is set default (or about default ~80%), e.g. 80GB. The first relatively long and heavy query (let's name it Query1) comes and one of its steps take ~70GB RAM at executor1, i.e. there is ~10GB available RAM at this executor for reservation. Other 4 executor servers are nearly idle. At the same time the second query (let's name Query2) comes, which requires 40GB RAM, and it might happen the Query2 is assigned to the executor1, which is busy. So the Query2 fails due to it cannot allocate/reserve the memory. Is there a way to configure Impala to assign fragments/query parts to less busy executors? So far the concurrency reduction or reservation removal (since reserved memory amount usually is larger than really used) might work, but I see it too inefficient to use only 1-2 executors out of 5. Impala on YARN potentially might help, but as far as I see, it requires Llama, which is deprecated and is going to be removed soon.

Andreyeff · ‎04-05-2019

Hi, I'm setting up Impala Admission Control. For the user.<username> placement rules, there is "Not recommended" remark "Use the resource pool that matches the username. (Not Recommended)" (https://www.cloudera.com/documentation/enterprise/5-16-x/topics/cm_mc_resource_pools.html) In my use-case, specific limitation for a set (a dozen) of users should be applied. At the same time the group management is relatively hard, therefore, I would prefer the root.[username] approach. I would like to better understand the drawbacks of this approach: are there any technical limitations from Impala point of view, or it is just a bad practice due to it is harder to maintain&support&manage (from Administrator's perspective)?

Andreyeff · ‎04-05-2019

Yes, that is for limiting the query in order not to reduce accidental influence on other users (i.e. by occupying all available resources). One more point: impala may have default query memory limits set, so you may wish to overwrite it.

Andreyeff · ‎09-20-2018

mpercy, reviewing the cluster size is in progress, but it takes long time. As it usually happen with on-premises hardware 🙂 However, the tablet number most likely will be a bottleneck in future too, so any related performance improvement is worth. I see there are improvements with optimization for deletion in 1.7.1, so yes, might be worth to consider too. Raft consensus timeouts - probably may help to avoid avelanches, but it sounds to be rather as fighting vs consequences. Alexey1c, 1. /etc/hosts is in place again after the issue appeared. It is the first point to check in the nswitch.conf 2. THP were disabled initially according to Cloudera recommendations. 3. Tablets were rebalanced. Thanks for the link - time to replace custom script. Basically, reduction of tablets amount (but still above recommendations), rebalancing and population of /etc/hosts - these were the first action points and they helped to reduce the occurance significantly. But "slow dns lookup" and "couldn't get scanned data" still appear from time to time.

Andreyeff · ‎09-18-2018

Yes, I understand that. Unfortunately, the usecase dictates conditions we're hitting the limits: 1) small amount of large servers -> not so large number of tablets available 2) several dozens of systems * hundred of tables each * 3-50 tablets per table * replication factor -> quite large number of tablets required May parameters tuning improve the situation with appearing backpressure, e.g. the default maintenance_manager_num_threads = 1 Does it make sense to change to 2-4? Maybe any other advises?

Andreyeff · ‎09-03-2018

We've reduced the amount of tablets to 4-4.5k per tserver and added populated /etc/hosts - the frequency has significantly dropped (previously it happened sometimes for 3 map attempts in a row and failed the job, now it happens rarely and handled by the second attempt). Application writes asynchronously, but shouldn't wait for such long. But I guess at OS level it still may be interrupted. I haven't seen the scanner ID in that tservers' logs. However, previously there were cases some time ago, when scanner ID wasn't found error appeared right after scanner creation and nearly after 60 seconds at one of tservers appeared this scanners timeout message. Regarding cluster check: During the backpressure issue the tservers may become inavailable in kudu (including "cluster ksck") and consensus is lost. After some time kudu returns to normal operational state. Are there any recommendations to reduce backpressure? Is it worth to increase stack from 50 items to larger? Or maybe any recommendations to tune kudu for larger tablet amount?

Andreyeff · ‎08-27-2018

Hi, We're facing with the instability of Kudu. We run map-reduce jobs, where mappers read from Kudu, process data, pass to reducers and reducers write to Kudu. Sometimes mappers fail with "Exception running child : java.io.IOException: Couldn't get scan data", caused by "<tablet_id> pretends to not know KuduScanner" (see mapper.txt in the link below). It happens with multiple attempts as well, which resulted in job failure. The environment is: CDH 5.15.0 Kudu 1.7 3 masters 15 tservers. Here is a failure example, which happened at 2018-08-27 10:26:41. This time there was also a restart of one of tservers. At that time at kudu tablet servers are observed multiple requests with backpressure and consensus loss (see attached files from 3 nodes, where replicas were placed). The logs on other tablets were removed, in the logs there are some minutes before and after the failure. Mapper error - https://expirebox.com/download/5cce0d1c712565547c2f382aab99a630.html node07 - https://expirebox.com/download/9be42eeb88a367639e207d0c148e6e09.html node12 - https://expirebox.com/download/0e021bd7fd929b9bd585e4e995729994.html node13 - https://expirebox.com/download/db31c5ac0305f18b6ef0e2171e2d034c.html Kudu leader at that time - https://expirebox.com/download/f24cd185e2bb4889dbc18b87c70fc4c8.html Limitations are in shape, except for tablets per server - currently a few have ~5000 tablets per server, others less. There are powerful enough servers with reserved capacity, so looking at metrics there are no anomalies/peaks in CPU, RAM, disk I/O, network. Side note: from time to time there appear "slow DNS" messages, where "real" may exceed limits (5s), but "user" and "system" are in good shape. Some time ago there were attempts to lookup DNS locally, but without positive effect. Still I don't expect this to be the root cause. Any suggestions how to tune the configuration is welcome as well. IP - Hostname - Role - tserver ID 10.250.250.11 - nodemaster01 - Kudu Master001 10.250.250.12 - nodemaster02 - Kudu Master002 10.250.250.13 - node01 - Kudu Master003 + Kudu Tablet Server 10.250.250.14 - node02 - Kudu Tablet Server ... 10.250.250.19 - node07 - Kudu Tablet Server - e8a1d4cacb7f4f219fd250704004d258 ... 10.250.250.24 - node12 - Kudu Tablet Server - 3a5ee46ab1284f4e9d4cdfe5d0b7f7fa 10.250.250.25 - node13 - Kudu Tablet Server - 9c9e4296811a47e4b1709d93772ae20b ...

Andreyeff · ‎07-19-2018

Oh, the issue appeared to be even in functions. Thanks for fast reply and raising a ticket.

Andreyeff · ‎07-19-2018

Hi, We're strugling with the issue that Impala does not provide access to SHOW CREATE VIEW statement for the owner of the view (as well as owner of underlying table). Sentry based authorization is used. The documentation (https://www.cloudera.com/documentation/enterprise/5-14-x/topics/impala_show.html#show_create_view) states that the required privileges should be: VIEW_METADATA privilege on the view and SELECT privilege on all underlying views and tables. In our case the user owns the view and table, therefore, I expect both are fulfilled. As you could see in the log below, the user has created, selected and dropped the view, but he couldn't see the CREATE statement. Invalidate metadata was tried too. Could you kindly help to resolve the issue, so that developers could check the CREATE statements - is there a missing bit or is it a bug? Environment: CDH 5.14.2 Impala 2.11.0 LDAP authentication Sentry file authorization Here is the log from different aspects: === Sentry file ======== [users] svc.analyticaldata_dq=analytical_data, ... ... [groups] analytical_data=analytical_data ... [roles] analytical_data=server=server1->db=analytical_data ... === Impala CLI ============= [node009:21000] > select version(); Query: select version() +-------------------------------------------------------------------------------------------+ | version() | +-------------------------------------------------------------------------------------------+ | impalad version 2.11.0-cdh5.14.2 RELEASE (build ed85dce709da9557aeb28be89e8044947708876c) | | Built on Tue Mar 27 13:39:48 PDT 2018 | +-------------------------------------------------------------------------------------------+ [node009:21000] > select user(); Query: select user() Query submitted at: 2018-07-19 15:30:16 (Coordinator: http://node009:25000) Query progress can be monitored at: http://node009:25000/query_plan?query_id=1e4cc7a8258b79ff:e58adb9100000000 +-----------------------+ | user() | +-----------------------+ | svc.analyticaldata_dq | +-----------------------+ Fetched 1 row(s) in 0.08s [node009:21000] > use analytical_data; Query: use analytical_data [node009:21000] > create view t as select count(*) from system9999.cases; Query: create view t as select count(*) from system9999.cases Query submitted at: 2018-07-19 15:24:52 (Coordinator: http://node009:25000) Query progress can be monitored at: http://node009:25000/query_plan?query_id=304454e5a834396a:c1fbf50a00000000 Fetched 0 row(s) in 0.08s [node009:21000] > select * from t; Query: select * from t Query submitted at: 2018-07-19 15:24:55 (Coordinator: http://node009:25000) Query progress can be monitored at: http://node009:25000/query_plan?query_id=27459f84b4308766:6ed0235200000000 +---------+ | _c0 | +---------+ | 6609331 | +---------+ Fetched 1 row(s) in 4.50s [node009:21000] > show create view t; Query: show create view t ERROR: AuthorizationException: User 'svc.analyticaldata_dq' does not have privileges to see the definition of view 'analytical_data.t'. [node009:21000] > drop view t; Query: drop view t === Metastore ============= [metastore]> select TBL_ID,TBL_NAME,OWNER,TBL_TYPE from TBLS where DB_ID=374406; +---------+--------------------------------------------+-----------------------+---------------+ | TBL_ID | TBL_NAME | OWNER | TBL_TYPE | +---------+--------------------------------------------+-----------------------+---------------+ | 1222804 | t | svc.analyticaldata_dq | VIRTUAL_VIEW |

Online	Offline
Last Visited	‎12-05-2019 10:43 AM

Member Since	‎05-03-2018 01:03 AM
Last Visited	‎12-05-2019 10:43 AM
Posts	25

Cloudera Community

Re: Protecting queries in Impala using pools and m...

Why Impala Admission Control root.[username] is no...

Re: Impala mem_limit query option is not working

Re: Kudu backpressure and service queue is full

Re: Kudu backpressure and service queue is full

Re: Kudu backpressure and service queue is full

Kudu backpressure and service queue is full

Re: SHOW CREATE VIEW fails for owner

SHOW CREATE VIEW fails for owner