Support Questions

Find answers, ask questions, and share your expertise

Datanode goes dows after few secs of starting

avatar

Datanode automatically goes down after a few sec on starting from ambari. i check that ambari agent is working.

datanode receives the heartbeat but no commands from namenode.

ambari agent log file.

INFO 2017-01-24 03:44:59,747 PythonExecutor.py:118 - Result: {'structuredOut': {}, 'stdout': '', 'stderr': '', 'exitcode': 1}
INFO 2017-01-24 03:45:07,970 Heartbeat.py:78 - Building Heartbeat: {responseId = 210, timestamp = 1485247507970, commandsInProgress = False, componentsMapped = True}
INFO 2017-01-24 03:45:08,129 Controller.py:214 - Heartbeat response received (id = 211)
INFO 2017-01-24 03:45:08,129 Controller.py:249 - No commands sent from ip-172-31-17-251.ec2.internal
INFO 2017-01-24 03:45:18,130 Heartbeat.py:78 - Building Heartbeat: {responseId = 211, timestamp = 1485247518130, commandsInProgress = False, componentsMapped = True}
INFO 2017-01-24 03:45:18,274 Controller.py:214 - Heartbeat response received (id = 212)
INFO 2017-01-24 03:45:18,274 Controller.py:249 - No commands sent from NAMENODE.ec2.internal





1 ACCEPTED SOLUTION

avatar
Master Mentor

@Punit kumar

Regarding your latest error:

java.io.IOException: Incompatible clusterIDs in 
/mnt/disk1/hadoop/hdfs/data: namenode clusterID = 
CID-297a140f-7cd6-4c73-afc8-bd0a7d01c0ee; datanode clusterID = 
CID-7591e6bd-ce9b-4b14-910c-c9603892a0f1 at 

Looks like your VERSION file has different cluster IDs present in NameNode and DataNode that need to be correct. So please check.

cat <dfs.namenode.name.dir>/current/VERSION
cat <dfs.datanode.data.dir>/current/VERSION 

Hence Copy the clusterID from nematode and put it in the VERSION file of datanode and then try again.

Please refer to: http://www.dedunu.info/2015/05/how-to-fix-incompatible-clusterids-in.html

.

View solution in original post

16 REPLIES 16

avatar
Master Mentor

@Punit kumar

1. Do you see any error / exception in the DataNode log?

2. After triggering DataNode start operation from Ambari UI do you see any Error/Exception in ambari-server.log?

If yest hen can you please share those log snippets here?

3. Are you able to start/stop the other components present on that agent host? (or only DataNode is having this issue)

4. The output of "top" command so that we can see if memory is available sufficiently.

5. Once you triger the commands from Ambari UI to start the DataNode you might see following kind of files getting created in "/var/lib/ambari-agent/data". Do you see any error in the errors file? command-3231.json (Number might be different in your case but the time stamp should be latest for these files) errors-3231.txt output-3231.txt

.

avatar

@Jay SenSharma

1. i didnt got any error on datanode log.

2. ambari-server.log

22:53:19,873  WARN [Thread-1] HeartbeatMonitor:150 - Heartbeat lost from host datanode.ec2.internal
22:53:19,874  WARN [Thread-1] HeartbeatMonitor:150 - Heartbeat lost from host datanode.ec2.internal
22:53:19,874  WARN [Thread-1] HeartbeatMonitor:165 - Setting component state to UNKNOWN for component GANGLIA_MONITOR on datanode.ec2.internal
22:53:19,874  WARN [Thread-1] HeartbeatMonitor:165 - Setting component state to UNKNOWN for component DATANODE on datanode.ec2.internal
22:53:19,874  WARN [Thread-1] HeartbeatMonitor:165 - Setting component state to UNKNOWN for component NODEMANAGER on datanode.ec2.internal
22:53:19,890  WARN [Thread-1] HeartbeatMonitor:150 - Heartbeat lost from host 

datanode.ec2.internal

22:53:19,890 WARN [Thread-1] HeartbeatMonitor:165 - Setting component state to UNKNOWN for component GANGLIA_MONITOR on

datanode.ec2.internal

avatar

@Jay SenSharma

3. other components on agent are running without an issues only issues is in datanode which goes down after few sec.

4. on running 'top' command i have enough space on agent

avatar

@Jay SenSharma

Hi Jay,

thnx for reply.

i got error on output-30684.txt.

2017-01-24 03:39:17,877 - File['/etc/hadoop/conf/slaves'] {'content': Template('slaves.j2'), 'owner': 'hdfs'}
2017-01-24 03:39:17,877 - Directory['/var/lib/hadoop-hdfs'] {'owner': 'hdfs', 'group': 'hadoop', 'mode': 0751, 'recursive': True}
2017-01-24 03:39:17,893 - Host contains mounts: ['/', '/proc', '/sys', '/dev/pts', '/dev/shm', '/mnt/disk1', '/mnt/disk2', '/proc/sys/fs/binfmt_misc'].
2017-01-24 03:39:17,894 - Mount point for directory /mnt/disk1/hadoop/hdfs/data is /mnt/disk1
2017-01-24 03:39:17,894 - Mount point for directory /mnt/disk2/hadoop/hdfs/data is /mnt/disk2
2017-01-24 03:39:17,895 - Directory['/var/run/hadoop/hdfs'] {'owner': 'hdfs', 'recursive': True}
2017-01-24 03:39:17,895 - Directory['/var/log/hadoop/hdfs'] {'owner': 'hdfs', 'recursive': True}
2017-01-24 03:39:17,896 - File['/var/run/hadoop/hdfs/hadoop-hdfs-datanode.pid'] {'action': ['delete'], 'not_if': 'ls /var/run/hadoop/hdfs/hadoop-hdfs-datanode.pid >/dev/null 2>&1 && ps `cat /var/run/hadoop/hdfs/hadoop-hdfs-datanode.pid` >/dev/null 2>&1'}
2017-01-24 03:39:17,919 - Deleting File['/var/run/hadoop/hdfs/hadoop-hdfs-datanode.pid']
2017-01-24 03:39:17,919 - Execute['ulimit -c unlimited;  su -s /bin/bash - hdfs -c 'export HADOOP_LIBEXEC_DIR=/usr/hdp/current/hadoop-client/libexec && /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /etc/hadoop/conf start datanode''] {'not_if': 'ls /var/run/hadoop/hdfs/hadoop-hdfs-datanode.pid >/dev/null 2>&1 && ps `cat /var/run/hadoop/hdfs/hadoop-hdfs-datanode.pid` >/dev/null 2>&1'}


avatar

that was not error. that was the output of that file.

i got nothing on error-30684.txt

output of command-30684.txt

          "namenode.ec2.internal"
        ],
        "hs_host": [
            "namenode.ec2.internal"
        ],
        "hive_server_host": [
            "namenode.ec2.internal"
        ]
    }
}


avatar
Master Mentor

@Punit kumar

Here based on the output of "output-30684.txt" file we see that the DataNode start instruction has been already given to the ambari-agent and following is the command snippet:

 /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /etc/hadoop/conf start datanode

- So after that "hadoop-daemon.sh" script is actually responsible to start the DataNode with the given arguments.

- Hence we should check the DataNode logs (.log and .out files) to finds out what is going wrong.

- There might be some OS resource constraints as well like (Less memory/disk space ..etc) We might get information about using some OS tools like "top" and "df -h" . But looking at the DataNode log / out will give more better idea here.

.

avatar

output of datanode.log

2017-01-24 04:59:13,837 INFO  datanode.DataNode (DataNode.java:shutdown(1720)) - Shutdown complete.
2017-01-24 04:59:13,839 FATAL datanode.DataNode (DataNode.java:secureMain(2385)) - Exception in secureMain
java.io.IOException: the path component: '/var/lib/hadoop-hdfs' is owned by a user who is not root and not you.  Your effective user id is 0; the path is owned by user id 508, and its permissions are 0751.  Please fix this or select a different socket path.
        at org.apache.hadoop.net.unix.DomainSocket.validateSocketPathSecurity0(Native Method)
        at org.apache.hadoop.net.unix.DomainSocket.bindAndListen(DomainSocket.java:189)
        at org.apache.hadoop.hdfs.net.DomainPeerServer.<init>(DomainPeerServer.java:40)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.getDomainPeerServer(DataNode.java:892)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initDataXceiver(DataNode.java:858)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:1056)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:415)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2268)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2155)
        at org.apache.hadoo
p.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2202)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2378)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:2402)
2017-01-24 04:59:13,841 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1
2017-01-24 04:59:13,843 INFO  datanode.DataNode (StringUtils.java:run(659)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at datanode.ec2.internal/datanode
************************************************************/

avatar

@Jay SenSharma

so the error is in log file of permissions

avatar

df -h

Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1       30G  9.9G   19G  36% /
tmpfs            16G     0   16G   0% /dev/shm
/dev/xvdf       1.1T  905G   75G  93% /mnt/disk1
/dev/xvdg       1.1T  890G   90G  91% /mnt/disk2