Created 08-03-2016 12:28 PM
We are having a 5 node cluster. ( 2 master and 3 slave) and we are running MR jobs. but we always see that only 2 nodes are getting loaded and utilized, while the other node remains idle. what all could be the reasons for this. all the 3 nodes are in the same rack.
Created 08-04-2016 04:00 AM
Please do the following it helped me.
1) Login to Ambari WI
HDFS -> Quick Links ->NameNode UI ->Datanodes
Check how capacity of HDFS and how much it utilized and left storage if blocks are not replicated equally.
Jobs will be running on Data Node only, hence it running on two nodes only.
2) While running the MRapp always try to get for Data Locality during JOB run.
3) Do the Load Balancer on cluster , data will be distributed across the Datanodes.
4) After Balancer completed , try to check how jobs are running.
still jobs are facing same , please update .
if its help full , your close is appreciated.
Created 08-03-2016 01:00 PM
Hi @ARUN
The main reason might be the data blocks needed for the MapReduce job to run are located in those two nodes itself.
Can you please check the data blocks of the file you are processing and verify that the data is distributed in 3 nodes.
Speculative execution( case when your nodes are too busy running the tasks then the data can be moved temporarily to the third node and run the task.) also not be happening.
Created 08-04-2016 04:00 AM
Please do the following it helped me.
1) Login to Ambari WI
HDFS -> Quick Links ->NameNode UI ->Datanodes
Check how capacity of HDFS and how much it utilized and left storage if blocks are not replicated equally.
Jobs will be running on Data Node only, hence it running on two nodes only.
2) While running the MRapp always try to get for Data Locality during JOB run.
3) Do the Load Balancer on cluster , data will be distributed across the Datanodes.
4) After Balancer completed , try to check how jobs are running.
still jobs are facing same , please update .
if its help full , your close is appreciated.
Created 08-04-2016 04:22 AM
In Addition to above answers:
1. Can you please check if nodemanager state is healthy for 3rd node? Sometimes, because of disk failure/reserved disk nodemanegers goes into unhealthy state although nodemanager daemon is still running, jobs will not get scheduled on problematic node.
2. Most important thing is try to run hdfs balancer if data distribution is un-even across all the datanodes.
3. Below is the command to run hdfs balancer
sudo -u hdfs hadoop balancer -threshold <threshold-value>
Note - default threshold is 10, you can reduce it upto 1 depending on how close you want to balance your cluster.
.
Hope this information helps!
Created 08-04-2016 05:10 AM
Thanks @Kuldeep Kulkarni and @Shiva Nagesh
Created 09-27-2016 06:19 PM
Hi all,
Could you please share more information about the following scenario... trying to run exactly the statement you mentioned... running from the active master_node :
16/09/26 17:42:53 INFO balancer.Balancer: namenodes = [hdfs://hadoop2, hdfs://linux.lab.domain.com:8020] 16/09/26 17:42:53 INFO balancer.Balancer: parameters = Balancer.Parameters [BalancingPolicy.Node, threshold = 5.0, max idle iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, run during upgrade = false] 16/09/26 17:42:53 INFO balancer.Balancer: included nodes = [] 16/09/26 17:42:53 INFO balancer.Balancer: excluded nodes = [] 16/09/26 17:42:53 INFO balancer.Balancer: source nodes = [] Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved 16/09/26 17:42:53 INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec 16/09/26 17:42:53 INFO block.BlockTokenSecretManager: Setting block keys 16/09/26 17:42:53 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec 16/09/26 17:42:54 INFO block.BlockTokenSecretManager: Setting block keys org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getServerDefaults(FSNamesystem.java:1618) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getServerDefaults(NameNodeRpcServer.java:595) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getServerDefaults(ClientNamenodeProtocolServerSideTranslatorPB.java:383) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Unknown Source) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131) . Exiting ...
. Exiting ... Sep 26, 2016 5:42:54 PM Balancing took 1.314 seconds
it runs in less than 2 seconds....what seems not running 100% , right ?
How many seconds it could take in average if it really runs fine ? or is there any log to check further information if some error is happenning ?
I also tried to run the same from the Ambari console (balance all nodes)... the same result is reached.
thanks and br