Support Questions
Find answers, ask questions, and share your expertise

How to balance the workload on cluster datanodes

How to balance the workload on cluster datanodes


I have a cluster with 4 datanode and 1 namenode. The cluster has 2 zookeeper servers, 1 on namenode and 1 one on the datanode. When I run benchmark on the cluster, I noticed that the datanode with zookeeper is busier than other 3 data nodes. The extra workload is from org.apache.spark.deploy.yarn.ExecutorLauncher.

/usr/jdk64/jdk1.8.0_112/bin/java -server -Xmx512m -Dhdp.version= org.apache.spark.deploy.yarn.ExecutorLauncher --arg msl-dpe-d13.msl.lab:38313 --properties-file /hadoop/yarn/local/usercache/spark/appcache/application_1551897449716_0001/container_e11_1551897449716_0001_01_000001/__spark_conf__/

My questions are:

1. Should org.apache.spark.deploy.yarn.ExecutorLauncher run on the namenode?

2. How to move org.apache.spark.deploy.yarn.ExecutorLauncher to run on namenode?


Re: How to balance the workload on cluster datanodes


@Harry Li

In principle, you should have 3 zookeepers usually called an "ensemble" to avoid the split brain decision in zookeepers!! Explanation: When running ZooKeeper in production on needs a quorum which means that more than half of the number of nodes are up and running. If your client is connecting with a ZooKeeper server which does not participate in a quorum, then it will not be able to answer any queries. This is the only way ZooKeeper is capable of protecting itself against split brains in case of a network partition.

You should install yarn client and node manager on all the data nodes, In Yarn cluster mode everything runs inside the cluster. when you start a job from a client the job will continue running even if you disconnect from your client because the Spark Driver is encapsulated inside the YARN Application Master which runs on the which runs on the Node Manager/data node.

Answers to your questions

1. NO

The name node has a different purpose [Metadata reference] It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

When you launch spark in cluster mode the job executes on the worker/data node where the data processing happens

2. NO

See above