Created 06-27-2017 10:49 AM
I'm trying to add 2 new datanodes to an existing HDP2.3 cluster through Ambari. The existing 36 data nodes have configuration 10CPU's, 56GB RAM and 8.5 TB disk size. The data node heap size is set as 1 GB.The 2 new ones to be added have configuration of 6 CPU's, 25GB RAM and 1 TB disk size. The HDFS disk usage is 7%. I'm able to start the NodeManager and AmbariMetrics service in the new nodes, but the datanode service goes down immediately after starting.
Below are the logs from hadoop-hdfs-datanode-worker1.log
2017-06-27 12:07:30,047 INFO datanode.DataNode (BPServiceActor.java:blockReport(488)) - Successfully sent block report 0x2235b2b47bf3a, containing 1 storage report(s), of which we sent 1. The reports had 19549 total blocks and used 1 RPC(s). This took 10 msec to generate and 695 msecs for RPC and NN processing. Got back no commands. 2017-06-27 12:07:36,003 ERROR datanode.DataNode (DataXceiver.java:run(278)) - worker1.bigdata.net.net:50010:DataXceiver error processing unknown operation src: /10.255.yy.yy:49656 dst: /10.255.xx.xx:50010 java.io.EOFException at java.io.DataInputStream.readShort(DataInputStream.java:315) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:58) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:227) at java.lang.Thread.run(Thread.java:745) 2017-06-27 12:08:00,180 INFO datanode.DataNode (DataXceiver.java:writeBlock(655)) - Receiving BP-1320493910-10.255.zz.zz-1479412973603:blk_1100238956_26515824 src: /10.254.yy.yy:45293 dest: /10.255.xx.xx:50010 2017-06-27 12:08:00,326 INFO DataNode.clienttrace (BlockReceiver.java:finalizeBlock(1432)) - src: /10.254.yy.yy:45293, dest: /10.255.xx.xx:50010, bytes: 26872748, op: HDFS_WRITE, cliID: DFSClient_attempt_1498498030455_0521_r_000001_0_-908535141_1, offset: 0, srvID: f148bbe2-8f2a-489b-b03d-c8322aecd43e, blockid: BP-1320493910-10.255.zz.zz-1479412973603:blk_1100238956_26515824, duration: 122445075 2017-06-27 12:08:00,326 INFO datanode.DataNode (BlockReceiver.java:run(1405)) - PacketResponder: BP-1320493910-10.255.12.202-1479412973603:blk_1100238956_26515824, type=HAS_DOWNSTREAM_IN_PIPELINE terminating
Thanks in advance.
Created 06-27-2017 12:08 PM
@Phoncy Joseph Old versions of ambari used to display such health messages. Usually they harmless. This was fixed in Ambari 2.3.0. https://issues.apache.org/jira/browse/AMBARI-12420. You can check if you are able to access and create files in DFS. You can also run hadoop fsck command to check its health status.
Created 06-27-2017 12:51 PM
These are usually caused by the alerts framework doing a port check on the DataNode - any unknown wire communication causes them to dump out an exception - they're harmless.
What makes you think that the DataNode is actually going down? If it was, you'd see it shutting down in the logs.
Created 06-27-2017 01:13 PM
In Ambari UI, the data node is in stopped state few seconds after starting it. As mentioned in the earlier reply, with hdfs fsck command the newly added nodes are also listed, though Ambari doesnt recognize the addition.
Created 06-27-2017 06:25 PM
If the DN is indeed going down, an alert should trigger as well. Can you post your DN log here in its entirety so we can see why it might be failing?
Created 06-29-2017 11:34 AM
The DN process is running if do a check in the machine using ps -ef. But Ambari incorrectly shows the DataNode process as stopped.
Created 06-29-2017 06:33 PM
So Ambari says that the DN is stopped but the alert is OK and the process is running. That sounds like it's a problem with the process ID check during the status commands.
Does this file exist:
/var/run/hadoop/hadoop-hdfs-datanode.pid
That would contain the PID of the DN. This may be customized in your environment, but chances are it's not.
- Stop the DN in Ambari
- Remove this file by hand
- Check for the DN to be stopped using ps
- Start up the DN in Ambari