05-17-2019 09:59 AM
We have hosted CDH 5.16 cluster on AWS. There are many data scientists who use Impala and run bad queries most times, or a query which goes with bad planning. We have custom cluster utilization reports generated which has CPU hours and Memory TB etc for both Impala & YARN, however it doesnt give a clear picture on when service hangs, what other jobs run during the same time when cluster utilization is high etc. We are not able to figure out with the aggregation metrics collected as a particular job will run for 1 hour when there is no load, and 3-5 hours when the utilization is high. We would like to identify and offload such jobs to another cluster. Any tips/suggestions on how to collect such metrics? We can also check 3rd party tools to collect such monitoring metrics. Will look forward for suggestions from the forum. Thanks in advance.