Support Questions

Find answers, ask questions, and share your expertise

Datanode goes dows after few secs of starting

avatar

Datanode automatically goes down after a few sec on starting from ambari. i check that ambari agent is working.

datanode receives the heartbeat but no commands from namenode.

ambari agent log file.

INFO 2017-01-24 03:44:59,747 PythonExecutor.py:118 - Result: {'structuredOut': {}, 'stdout': '', 'stderr': '', 'exitcode': 1}
INFO 2017-01-24 03:45:07,970 Heartbeat.py:78 - Building Heartbeat: {responseId = 210, timestamp = 1485247507970, commandsInProgress = False, componentsMapped = True}
INFO 2017-01-24 03:45:08,129 Controller.py:214 - Heartbeat response received (id = 211)
INFO 2017-01-24 03:45:08,129 Controller.py:249 - No commands sent from ip-172-31-17-251.ec2.internal
INFO 2017-01-24 03:45:18,130 Heartbeat.py:78 - Building Heartbeat: {responseId = 211, timestamp = 1485247518130, commandsInProgress = False, componentsMapped = True}
INFO 2017-01-24 03:45:18,274 Controller.py:214 - Heartbeat response received (id = 212)
INFO 2017-01-24 03:45:18,274 Controller.py:249 - No commands sent from NAMENODE.ec2.internal





1 ACCEPTED SOLUTION

avatar
Master Mentor

@Punit kumar

Regarding your latest error:

java.io.IOException: Incompatible clusterIDs in 
/mnt/disk1/hadoop/hdfs/data: namenode clusterID = 
CID-297a140f-7cd6-4c73-afc8-bd0a7d01c0ee; datanode clusterID = 
CID-7591e6bd-ce9b-4b14-910c-c9603892a0f1 at 

Looks like your VERSION file has different cluster IDs present in NameNode and DataNode that need to be correct. So please check.

cat <dfs.namenode.name.dir>/current/VERSION
cat <dfs.datanode.data.dir>/current/VERSION 

Hence Copy the clusterID from nematode and put it in the VERSION file of datanode and then try again.

Please refer to: http://www.dedunu.info/2015/05/how-to-fix-incompatible-clusterids-in.html

.

View solution in original post

16 REPLIES 16

avatar

top

top - 05:03:36 up  4:41,  1 user,  load average: 0.00, 0.00, 0.00
Tasks: 186 total,   1 running, 185 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us,  0.4%sy,  0.0%ni, 99.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  32877652k total,  1678960k used, 31198692k free,   335884k buffers
Swap:        0k total,        0k used,        0k free,   517928k cached



avatar
Master Mentor

@Punit kumar

Problem seems to be directory permission related:

java.io.IOException: the path component: '/var/lib/hadoop-hdfs' is owned by a user who is not root and not you.  Your effective user id is 0; the path is owned by user id 508, and its permissions are 0751.  Please fix this or select a different socket path.

.

- As the DN log is complaining about the permission on "/var/lib/hadoop-hdfs" so please check what kind of permission do you have there. By default it should be owned by "hdfs:hadoop" as following:

# ls -lart /var/lib/hadoop-hdfs
drwxrwxrwt.  2 hdfs hadoop 4096 Aug 10 11:23 cache
srw-rw-rw-.  1 hdfs hadoop    0 Jan 24 09:09 dn_socket

- It would be best if you compare the permission on this Directory "/var/lib/hadoop-hdfs" from your Working DataNode hosts.

- In order to get more information about this exception, please see the use of "validateSocketPathSecurity0" method:

https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/ap...

.

avatar

yeah that was right

total 4
drwxrwxrwt. 2 hdfs hadoop 4096 Nov 19  2014 cache
srw-rw-rw-. 1 hdfs hadoop    0 Jan 24 03:39 dn_socket


actually non of my datanode host is working.
is that memory issue.

avatar

@Jay SenSharma

now am getting this error in datanode.log

 2017-01-24 03:39:19,891 INFO  common.Storage (Storage.java:tryLock(715)) - Lock on /mnt/disk1/hadoop/hdfs/data/in_use.lock acquired by nodename 1491@datanode.ec2.internal
2017-01-24 03:39:19,902 INFO  common.Storage (Storage.java:tryLock(715)) - Lock on /mnt/disk2/hadoop/hdfs/data/in_use.lock acquired by nodename 1491@datanode.ec2.internal
2017-01-24 03:39:19,903 FATAL datanode.DataNode (BPServiceActor.java:run(840)) - Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to namenode.ec2.internal/

namenode:8020. Exiting.

java.io.IOException: Incompatible clusterIDs in /mnt/disk1/hadoop/hdfs/data: namenode clusterID = CID-297a140f-7cd6-4c73-afc8-bd0a7d01c0ee; datanode clusterID = CID-7591e6bd-ce9b-4b14-910c-c9603892a0f1 at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:646) at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:320) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:403) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:422) at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1311) at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1276) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:314) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:220) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:828) at java.lang.Thread.run(Thread.java:745) 2017-01-24 03:39:19,904 WARN datanode.DataNode (BPServiceActor.java:run(861)) - Ending block pool service for: Block pool <registering> (Datanode Uuid unassigned) service to ip-172-31-17-251.ec2.internal/172.31.17.251:8020 2017-01-24 03:39:20,005 INFO datanode.DataNode (BlockPoolManager.java:remove(103)) - Removed Block pool <registering> (Datanode Uuid unassigned) 2017-01-24 03:39:22,005 WARN datanode.DataNode (DataNode.java:secureMain(2392)) - Exiting Datanode 2017-01-24 03:39:22,007 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 0 2017-01-24 03:39:22,008 INFO datanode.DataNode (StringUtils.java:run(659)) - SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down DataNode at datanode.ec2.internal/datanode ************************************************************/

avatar
Master Mentor

@Punit kumar

I will suggest you to kill these DataNodes ()if there are any DN daemon processes running) and then try manually starting them as "hdfs" user. to see if those are getting started fine or not? In Parallel put the DataNode log in "tail" so that we can see if it is showing the same error or not ?

Once they come up successfully then next time try from Ambari.

avatar
Master Mentor

@Punit kumar

Regarding your latest error:

java.io.IOException: Incompatible clusterIDs in 
/mnt/disk1/hadoop/hdfs/data: namenode clusterID = 
CID-297a140f-7cd6-4c73-afc8-bd0a7d01c0ee; datanode clusterID = 
CID-7591e6bd-ce9b-4b14-910c-c9603892a0f1 at 

Looks like your VERSION file has different cluster IDs present in NameNode and DataNode that need to be correct. So please check.

cat <dfs.namenode.name.dir>/current/VERSION
cat <dfs.datanode.data.dir>/current/VERSION 

Hence Copy the clusterID from nematode and put it in the VERSION file of datanode and then try again.

Please refer to: http://www.dedunu.info/2015/05/how-to-fix-incompatible-clusterids-in.html

.

avatar

@Jay SenSharma

Thnx, now its working.