Support Questions

arunpoy · ‎08-03-2016

We are having a 5 node cluster. ( 2 master and 3 slave) and we are running MR jobs. but we always see that only 2 nodes are getting loaded and utilized, while the other node remains idle. what all could be the reasons for this. all the 3 nodes are in the same rack.

shivkumar82015 · ‎08-04-2016

Please do the following it helped me.

1) Login to Ambari WI

HDFS -> Quick Links ->NameNode UI ->Datanodes

Check how capacity of HDFS and how much it utilized and left storage if blocks are not replicated equally.

Jobs will be running on Data Node only, hence it running on two nodes only.

2) While running the MRapp always try to get for Data Locality during JOB run.

3) Do the Load Balancer on cluster , data will be distributed across the Datanodes.

4) After Balancer completed , try to check how jobs are running.

still jobs are facing same , please update .

if its help full , your close is appreciated.

View solution in original post

shivanageshch · ‎08-03-2016

Hi @ARUN

The main reason might be the data blocks needed for the MapReduce job to run are located in those two nodes itself.

Can you please check the data blocks of the file you are processing and verify that the data is distributed in 3 nodes.

Speculative execution( case when your nodes are too busy running the tasks then the data can be moved temporarily to the third node and run the task.) also not be happening.

shivkumar82015 · ‎08-04-2016

Please do the following it helped me.

1) Login to Ambari WI

HDFS -> Quick Links ->NameNode UI ->Datanodes

Check how capacity of HDFS and how much it utilized and left storage if blocks are not replicated equally.

Jobs will be running on Data Node only, hence it running on two nodes only.

2) While running the MRapp always try to get for Data Locality during JOB run.

3) Do the Load Balancer on cluster , data will be distributed across the Datanodes.

4) After Balancer completed , try to check how jobs are running.

still jobs are facing same , please update .

if its help full , your close is appreciated.

KuldeepK · ‎08-04-2016

@ARUN

In Addition to above answers:

1. Can you please check if nodemanager state is healthy for 3rd node? Sometimes, because of disk failure/reserved disk nodemanegers goes into unhealthy state although nodemanager daemon is still running, jobs will not get scheduled on problematic node.

2. Most important thing is try to run hdfs balancer if data distribution is un-even across all the datanodes.

3. Below is the command to run hdfs balancer

sudo -u hdfs hadoop balancer -threshold <threshold-value>

Note - default threshold is 10, you can reduce it upto 1 depending on how close you want to balance your cluster.

.

Hope this information helps!

arunpoy · ‎08-04-2016

Thanks @Kuldeep Kulkarni and @Shiva Nagesh

leonardo_costa · ‎09-27-2016

Hi all,

Could you please share more information about the following scenario... trying to run exactly the statement you mentioned... running from the active master_node :

16/09/26 17:42:53 INFO balancer.Balancer: namenodes = [hdfs://hadoop2, hdfs://linux.lab.domain.com:8020] 16/09/26 17:42:53 INFO balancer.Balancer: parameters = Balancer.Parameters [BalancingPolicy.Node, threshold = 5.0, max idle iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, run during upgrade = false] 16/09/26 17:42:53 INFO balancer.Balancer: included nodes = [] 16/09/26 17:42:53 INFO balancer.Balancer: excluded nodes = [] 16/09/26 17:42:53 INFO balancer.Balancer: source nodes = [] Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved 16/09/26 17:42:53 INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec 16/09/26 17:42:53 INFO block.BlockTokenSecretManager: Setting block keys 16/09/26 17:42:53 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec 16/09/26 17:42:54 INFO block.BlockTokenSecretManager: Setting block keys org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getServerDefaults(FSNamesystem.java:1618) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getServerDefaults(NameNodeRpcServer.java:595) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getServerDefaults(ClientNamenodeProtocolServerSideTranslatorPB.java:383) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Unknown Source) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131) . Exiting ...

. Exiting ... Sep 26, 2016 5:42:54 PM Balancing took 1.314 seconds

it runs in less than 2 seconds....what seems not running 100% , right ?

How many seconds it could take in average if it really runs fine ? or is there any log to check further information if some error is happenning ?

I also tried to run the same from the Ambari console (balance all nodes)... the same result is reached.

thanks and br

Cloudera Community

Support Questions

Load not distributed in the cluster

NiFi Cluster and Load Balancer

How Do I Distribute Data Across an Apache NiFi Clu...

Nifi cluster load balance doesn't work well

Tactical modularity in CDE Airflow by loading code...

Cluster Installation distributing parcel problem

Custom Parcels - How to distribute your own librar...

Load balancing in NiFi - Heterogenous Nodes in Clu...

Cache Aware Load Balancer in Apache HBase

Write or Append failures in very small Clusters, u...

Spark distributed classpath