Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Datanode going down after after few seconds of starting

Datanode going down after after few seconds of starting

I'm trying to add 2 new datanodes to an existing HDP2.3 cluster through Ambari. The existing 36 data nodes have configuration 10CPU's, 56GB RAM and 8.5 TB disk size. The data node heap size is set as 1 GB.The 2 new ones to be added have configuration of 6 CPU's, 25GB RAM and 1 TB disk size. The HDFS disk usage is 7%. I'm able to start the NodeManager and AmbariMetrics service in the new nodes, but the datanode service goes down immediately after starting.

Below are the logs from hadoop-hdfs-datanode-worker1.log

2017-06-27 12:07:30,047 INFO  datanode.DataNode (BPServiceActor.java:blockReport(488)) - Successfully sent block report 0x2235b2b47bf3a,  containing 1 storage report(s), of which we sent 1. The reports had 19549 total blocks and used 1 RPC(s). This took 10 msec to generate and 695 msecs for RPC and NN processing. Got back no commands.
2017-06-27 12:07:36,003 ERROR datanode.DataNode (DataXceiver.java:run(278)) - worker1.bigdata.net.net:50010:DataXceiver error processing unknown operation  src: /10.255.yy.yy:49656 dst: /10.255.xx.xx:50010
java.io.EOFException
        at java.io.DataInputStream.readShort(DataInputStream.java:315)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:58)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:227)
        at java.lang.Thread.run(Thread.java:745)
2017-06-27 12:08:00,180 INFO  datanode.DataNode (DataXceiver.java:writeBlock(655)) - Receiving BP-1320493910-10.255.zz.zz-1479412973603:blk_1100238956_26515824 src: /10.254.yy.yy:45293 dest: /10.255.xx.xx:50010
2017-06-27 12:08:00,326 INFO  DataNode.clienttrace (BlockReceiver.java:finalizeBlock(1432)) - src: /10.254.yy.yy:45293, dest: /10.255.xx.xx:50010, bytes: 26872748, op: HDFS_WRITE, cliID: DFSClient_attempt_1498498030455_0521_r_000001_0_-908535141_1, offset: 0, srvID: f148bbe2-8f2a-489b-b03d-c8322aecd43e, blockid: BP-1320493910-10.255.zz.zz-1479412973603:blk_1100238956_26515824, duration: 122445075
2017-06-27 12:08:00,326 INFO  datanode.DataNode (BlockReceiver.java:run(1405)) - PacketResponder: BP-1320493910-10.255.12.202-1479412973603:blk_1100238956_26515824, type=HAS_DOWNSTREAM_IN_PIPELINE terminating


Thanks in advance.

6 REPLIES 6

Re: Datanode going down after after few seconds of starting

Contributor

@Phoncy Joseph Old versions of ambari used to display such health messages. Usually they harmless. This was fixed in Ambari 2.3.0. https://issues.apache.org/jira/browse/AMBARI-12420. You can check if you are able to access and create files in DFS. You can also run hadoop fsck command to check its health status.

Re: Datanode going down after after few seconds of starting

Super Collaborator

These are usually caused by the alerts framework doing a port check on the DataNode - any unknown wire communication causes them to dump out an exception - they're harmless.

What makes you think that the DataNode is actually going down? If it was, you'd see it shutting down in the logs.

Re: Datanode going down after after few seconds of starting

In Ambari UI, the data node is in stopped state few seconds after starting it. As mentioned in the earlier reply, with hdfs fsck command the newly added nodes are also listed, though Ambari doesnt recognize the addition.

Re: Datanode going down after after few seconds of starting

Super Collaborator

If the DN is indeed going down, an alert should trigger as well. Can you post your DN log here in its entirety so we can see why it might be failing?

Re: Datanode going down after after few seconds of starting

The DN process is running if do a check in the machine using ps -ef. But Ambari incorrectly shows the DataNode process as stopped.

Highlighted

Re: Datanode going down after after few seconds of starting

Super Collaborator

So Ambari says that the DN is stopped but the alert is OK and the process is running. That sounds like it's a problem with the process ID check during the status commands.

Does this file exist:

/var/run/hadoop/hadoop-hdfs-datanode.pid

That would contain the PID of the DN. This may be customized in your environment, but chances are it's not.

- Stop the DN in Ambari

- Remove this file by hand

- Check for the DN to be stopped using ps

- Start up the DN in Ambari

Don't have an account?
Coming from Hortonworks? Activate your account here