We are trying to identify cluster utilization metrics (specific to Cloudera), which can provide good KPI for predicting future workloads, managing users etc with a large number of Data Scientists in place keeping in mind a heavy utilization on Impala, Spark and YARN processes. Is there an analysis done by DataScience CoE with Cloudera?
Some examples could be CPU Utilization, YARN/Impala Utilization, Long running Impala jobs etc with a prediction model.
We explored a little bit of Cloudera Workload XM, but that doesn’t give any predictive analysis on Cluster utilization.
Has someone tried Time Series Table Metrics option and derived a conclusion based on that? Any tips/suggestions?