Support Questions
Find answers, ask questions, and share your expertise

How to configure SPARK/YARN to consume memory efficiently?



I'm running a terasort test I've taken from here.

While the sort is running I noticed that many GB of disks are used by at /hadoop/yarn/local (that's: yarn.nodemanager.local-dirs). It means lots of writing and reading to a physical disk other than the files to sort.

In addition, I see that most of my 256GB of RAM is used by buff/cache.
Is this the most efficient usage of the memory I have?

In order to use more RAM for sorting to speed up the processing, I tried to tune various values for:

- yarn.scheduler.minimum-allocation-mb (between 4G .. 8G)

- SPARK_EXECUTOR_MEMORY (between 4G .. 30G)

- SPARK_EXECUTOR_INSTANCES (Between 4 - 16 - where each instance has allocated 4-5 cores w/o over commit)

Once I use more that 8 instances, I don't see a significant change in the time it takes to complete the sort with any configuration.

Any idea what else I should look for and what to tune to get the best usage of the memory I have on the server and achieve high performance rate?