Created 01-24-2017 08:59 AM
Datanode automatically goes down after a few sec on starting from ambari. i check that ambari agent is working.
datanode receives the heartbeat but no commands from namenode.
ambari agent log file.
INFO 2017-01-24 03:44:59,747 PythonExecutor.py:118 - Result: {'structuredOut': {}, 'stdout': '', 'stderr': '', 'exitcode': 1} INFO 2017-01-24 03:45:07,970 Heartbeat.py:78 - Building Heartbeat: {responseId = 210, timestamp = 1485247507970, commandsInProgress = False, componentsMapped = True} INFO 2017-01-24 03:45:08,129 Controller.py:214 - Heartbeat response received (id = 211) INFO 2017-01-24 03:45:08,129 Controller.py:249 - No commands sent from ip-172-31-17-251.ec2.internal INFO 2017-01-24 03:45:18,130 Heartbeat.py:78 - Building Heartbeat: {responseId = 211, timestamp = 1485247518130, commandsInProgress = False, componentsMapped = True} INFO 2017-01-24 03:45:18,274 Controller.py:214 - Heartbeat response received (id = 212) INFO 2017-01-24 03:45:18,274 Controller.py:249 - No commands sent from NAMENODE.ec2.internal
Created 01-24-2017 12:26 PM
Regarding your latest error:
java.io.IOException: Incompatible clusterIDs in /mnt/disk1/hadoop/hdfs/data: namenode clusterID = CID-297a140f-7cd6-4c73-afc8-bd0a7d01c0ee; datanode clusterID = CID-7591e6bd-ce9b-4b14-910c-c9603892a0f1 at
Looks like your VERSION file has different cluster IDs present in NameNode and DataNode that need to be correct. So please check.
cat <dfs.namenode.name.dir>/current/VERSION cat <dfs.datanode.data.dir>/current/VERSION
Hence Copy the clusterID from nematode and put it in the VERSION file of datanode and then try again.
Please refer to: http://www.dedunu.info/2015/05/how-to-fix-incompatible-clusterids-in.html
.
Created 01-24-2017 09:03 AM
1. Do you see any error / exception in the DataNode log?
2. After triggering DataNode start operation from Ambari UI do you see any Error/Exception in ambari-server.log?
If yest hen can you please share those log snippets here?
3. Are you able to start/stop the other components present on that agent host? (or only DataNode is having this issue)
4. The output of "top" command so that we can see if memory is available sufficiently.
5. Once you triger the commands from Ambari UI to start the DataNode you might see following kind of files getting created in "/var/lib/ambari-agent/data". Do you see any error in the errors file? command-3231.json (Number might be different in your case but the time stamp should be latest for these files) errors-3231.txt output-3231.txt
.
Created 01-24-2017 09:52 AM
1. i didnt got any error on datanode log.
2. ambari-server.log
22:53:19,873 WARN [Thread-1] HeartbeatMonitor:150 - Heartbeat lost from host datanode.ec2.internal 22:53:19,874 WARN [Thread-1] HeartbeatMonitor:150 - Heartbeat lost from host datanode.ec2.internal 22:53:19,874 WARN [Thread-1] HeartbeatMonitor:165 - Setting component state to UNKNOWN for component GANGLIA_MONITOR on datanode.ec2.internal 22:53:19,874 WARN [Thread-1] HeartbeatMonitor:165 - Setting component state to UNKNOWN for component DATANODE on datanode.ec2.internal 22:53:19,874 WARN [Thread-1] HeartbeatMonitor:165 - Setting component state to UNKNOWN for component NODEMANAGER on datanode.ec2.internal 22:53:19,890 WARN [Thread-1] HeartbeatMonitor:150 - Heartbeat lost from host
datanode.ec2.internal
22:53:19,890 WARN [Thread-1] HeartbeatMonitor:165 - Setting component state to UNKNOWN for component GANGLIA_MONITOR ondatanode.ec2.internal
Created 01-24-2017 09:55 AM
3. other components on agent are running without an issues only issues is in datanode which goes down after few sec.
4. on running 'top' command i have enough space on agent
Created 01-24-2017 09:40 AM
Hi Jay,
thnx for reply.
i got error on output-30684.txt.
2017-01-24 03:39:17,877 - File['/etc/hadoop/conf/slaves'] {'content': Template('slaves.j2'), 'owner': 'hdfs'} 2017-01-24 03:39:17,877 - Directory['/var/lib/hadoop-hdfs'] {'owner': 'hdfs', 'group': 'hadoop', 'mode': 0751, 'recursive': True} 2017-01-24 03:39:17,893 - Host contains mounts: ['/', '/proc', '/sys', '/dev/pts', '/dev/shm', '/mnt/disk1', '/mnt/disk2', '/proc/sys/fs/binfmt_misc']. 2017-01-24 03:39:17,894 - Mount point for directory /mnt/disk1/hadoop/hdfs/data is /mnt/disk1 2017-01-24 03:39:17,894 - Mount point for directory /mnt/disk2/hadoop/hdfs/data is /mnt/disk2 2017-01-24 03:39:17,895 - Directory['/var/run/hadoop/hdfs'] {'owner': 'hdfs', 'recursive': True} 2017-01-24 03:39:17,895 - Directory['/var/log/hadoop/hdfs'] {'owner': 'hdfs', 'recursive': True} 2017-01-24 03:39:17,896 - File['/var/run/hadoop/hdfs/hadoop-hdfs-datanode.pid'] {'action': ['delete'], 'not_if': 'ls /var/run/hadoop/hdfs/hadoop-hdfs-datanode.pid >/dev/null 2>&1 && ps `cat /var/run/hadoop/hdfs/hadoop-hdfs-datanode.pid` >/dev/null 2>&1'} 2017-01-24 03:39:17,919 - Deleting File['/var/run/hadoop/hdfs/hadoop-hdfs-datanode.pid'] 2017-01-24 03:39:17,919 - Execute['ulimit -c unlimited; su -s /bin/bash - hdfs -c 'export HADOOP_LIBEXEC_DIR=/usr/hdp/current/hadoop-client/libexec && /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /etc/hadoop/conf start datanode''] {'not_if': 'ls /var/run/hadoop/hdfs/hadoop-hdfs-datanode.pid >/dev/null 2>&1 && ps `cat /var/run/hadoop/hdfs/hadoop-hdfs-datanode.pid` >/dev/null 2>&1'}
Created 01-24-2017 09:44 AM
that was not error. that was the output of that file.
i got nothing on error-30684.txt
output of command-30684.txt
"namenode.ec2.internal" ], "hs_host": [ "namenode.ec2.internal" ], "hive_server_host": [ "namenode.ec2.internal" ] } }
Created 01-24-2017 09:51 AM
Here based on the output of "output-30684.txt" file we see that the DataNode start instruction has been already given to the ambari-agent and following is the command snippet:
/usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /etc/hadoop/conf start datanode
- So after that "hadoop-daemon.sh" script is actually responsible to start the DataNode with the given arguments.
- Hence we should check the DataNode logs (.log and .out files) to finds out what is going wrong.
- There might be some OS resource constraints as well like (Less memory/disk space ..etc) We might get information about using some OS tools like "top" and "df -h" . But looking at the DataNode log / out will give more better idea here.
.
Created 01-24-2017 10:06 AM
output of datanode.log
2017-01-24 04:59:13,837 INFO datanode.DataNode (DataNode.java:shutdown(1720)) - Shutdown complete. 2017-01-24 04:59:13,839 FATAL datanode.DataNode (DataNode.java:secureMain(2385)) - Exception in secureMain java.io.IOException: the path component: '/var/lib/hadoop-hdfs' is owned by a user who is not root and not you. Your effective user id is 0; the path is owned by user id 508, and its permissions are 0751. Please fix this or select a different socket path. at org.apache.hadoop.net.unix.DomainSocket.validateSocketPathSecurity0(Native Method) at org.apache.hadoop.net.unix.DomainSocket.bindAndListen(DomainSocket.java:189) at org.apache.hadoop.hdfs.net.DomainPeerServer.<init>(DomainPeerServer.java:40) at org.apache.hadoop.hdfs.server.datanode.DataNode.getDomainPeerServer(DataNode.java:892) at org.apache.hadoop.hdfs.server.datanode.DataNode.initDataXceiver(DataNode.java:858) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:1056) at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:415) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2268) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2155) at org.apache.hadoo p.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2202) at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2378) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:2402) 2017-01-24 04:59:13,841 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1 2017-01-24 04:59:13,843 INFO datanode.DataNode (StringUtils.java:run(659)) - SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down DataNode at datanode.ec2.internal/datanode ************************************************************/
Created 01-24-2017 10:11 AM
so the error is in log file of permissions
Created 01-24-2017 10:08 AM
df -h
Filesystem Size Used Avail Use% Mounted on /dev/xvda1 30G 9.9G 19G 36% / tmpfs 16G 0 16G 0% /dev/shm /dev/xvdf 1.1T 905G 75G 93% /mnt/disk1 /dev/xvdg 1.1T 890G 90G 91% /mnt/disk2