Support Questions

Find answers, ask questions, and share your expertise

How do you know if you need to scale out your cluster for computation reasons and not for HDFS space?


Typically you either need to scale out due to HDFS disk usage, or you need to scale out for computational reasons.

If I have 10 or so datanodes and they are each allocated 80% of the system memory for YARN, would them all running 100% of their YARN allocation for a majority of the day indicate that I need to scale out for computational reasons? Currently only my HDFS is at 60% utilization.

I am primarily running Tez jobs, CPU doesn't seem to be hit as much, but my YARN memory allocation is constantly 100% and I have users complaining about slow running jobs. I assume this is because they have to wait for other jobs to free up resources for them to get their job to run.

Are there any things I could look for in this situation?

Running Ambari 2.5.1 and HDP 2.6.1.



You are correct in determining that your compute is constrained and HDFS is not.

Before scaling out, you can try to do the following:

  • Optimize your jobs/queries. If you are running hive queries there is probably large potential to optimize your queries. Tez configurations may need optimizing as well. (See links below)
  • Reconfigure YARN queues to prioritize user jobs over other jobs (e.g. batch ETL) by allowing users queues to preempt the other queues.

If these do not prevent YARN memory saturation (first bullet) or speed user jobs/queries (second bullet) then you will need to scale out by adding more data nodes.

You should also be doing capacity planning. If you project your cluster usage will increase steadily (more jobs, more concurrent users) then optimizing as above likely is only buying you some time before the increased usage brings you to the same state. Note also that if you project an increase of data stored on the cluster then your HDFS utilization will climb steadily from the current 60%. It is a good practice to not let it exceed 80%, since disc space is also needed for writing intermediate results during jobs. If you are on bare metal, you will need some lead time to procure and rack-stack your data nodes, so you will need to plan to scale out well before HDFS capacity hits 80% or cluster usage increases significantly.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.