New Contributor
Posts: 5
Registered: ‎09-20-2018

Impala/YARN : Cluster overloading on AWS

We have hosted CDH 5.16 cluster on AWS. There are many data scientists who use Impala and run bad queries most times, or a query which goes with bad planning. We have custom cluster utilization reports generated which has CPU hours and Memory TB etc for both Impala & YARN, however it doesnt give a clear picture on when service hangs, what other jobs run during the same time when cluster utilization is high etc. We are not able to figure out with the aggregation metrics collected as a particular job will run for 1 hour when there is no load, and 3-5 hours when the utilization is high. We would like to identify and offload such jobs to another cluster. Any tips/suggestions on how to collect such metrics? We can also check 3rd party tools to collect such metrics. Will look forward for suggestions from the forum. Thanks in advance.

Cloudera Employee
Posts: 437
Registered: ‎07-29-2015

Re: Impala/YARN : Cluster overloading on AWS

Not sure if this is an option for you, but WorkloadXM is designed to make visualising and analysing such problems much easier -