Created 09-27-2016 09:34 AM
how to repair Unhealthy Nodemanager ??
i restart Yarn service but i have 4 nodemanagers started and 1 unhealthy , when i try to ckeck
/var/log/hadoop/yarn i dont find any log , so how to repair Unhealthy Nodemanager
Created 09-27-2016 11:50 AM
i found the solution go to
yarn.nodemanager.disk-health-checker.min-healthy-disks
and change the value to 0 and restart yarn and it gonna work.
Created 09-27-2016 09:35 AM
@Mourad Chahri Can you check if you have enough disk available on the node ?
Created 09-27-2016 09:39 AM
@Sandeep Nemuri yes i have enough space on disk
Created 09-27-2016 09:38 AM
Could you please check from Ambari - reason for unhealthy node?
Created 09-27-2016 09:40 AM
Created 09-27-2016 09:43 AM
@Mourad Chahri Can you please restart only the unhealthy nodemanager and check if its coming up correctly?
If it fails, please share the error message. You can find the error message from ambari start service dialogue window.
Please let me know if you have any questions regarding this. Happy to help.
Created 09-27-2016 09:48 AM
yes i can restart the unhealthy nodemanager i have this on log
2016-09-27 09:44:32,687 - Group['hadoop'] {'ignore_failures': False} 2016-09-27 09:44:32,690 - Group['users'] {'ignore_failures': False} 2016-09-27 09:44:32,691 - User['hive'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']} 2016-09-27 09:44:32,692 - User['mapred'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']} 2016-09-27 09:44:32,693 - User['accumulo'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']} 2016-09-27 09:44:32,694 - User['hbase'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']} 2016-09-27 09:44:32,695 - User['ambari-qa'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['users']} 2016-09-27 09:44:32,696 - User['zookeeper'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']} 2016-09-27 09:44:32,697 - User['tez'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['users']} 2016-09-27 09:44:32,698 - User['hdfs'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']} 2016-09-27 09:44:32,699 - User['sqoop'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']} 2016-09-27 09:44:32,700 - User['hcat'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']} 2016-09-27 09:44:32,701 - User['yarn'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']} 2016-09-27 09:44:32,702 - User['ams'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']} 2016-09-27 09:44:32,703 - File['/var/lib/ambari-agent/data/tmp/changeUid.sh'] {'content': StaticFile('changeToSecureUid.sh'), 'mode': 0555} 2016-09-27 09:44:32,734 - Execute['/var/lib/ambari-agent/data/tmp/changeUid.sh ambari-qa /tmp/hadoop-ambari-qa,/tmp/hsperfdata_ambari-qa,/home/ambari-qa,/tmp/ambari-qa,/tmp/sqoop-ambari-qa'] {'not_if': '(test $(id -u ambari-qa) -gt 1000) || (false)'} 2016-09-27 09:44:32,741 - Skipping Execute['/var/lib/ambari-agent/data/tmp/changeUid.sh ambari-qa /tmp/hadoop-ambari-qa,/tmp/hsperfdata_ambari-qa,/home/ambari-qa,/tmp/ambari-qa,/tmp/sqoop-ambari-qa'] due to not_if 2016-09-27 09:44:32,742 - Directory['/tmp/hbase-hbase'] {'owner': 'hbase', 'recursive': True, 'mode': 0775, 'cd_access': 'a'} 2016-09-27 09:44:32,757 - File['/var/lib/ambari-agent/data/tmp/changeUid.sh'] {'content': StaticFile('changeToSecureUid.sh'), 'mode': 0555} 2016-09-27 09:44:32,759 - Execute['/var/lib/ambari-agent/data/tmp/changeUid.sh hbase /home/hbase,/tmp/hbase,/usr/bin/hbase,/var/log/hbase,/tmp/hbase-hbase'] {'not_if': '(test $(id -u hbase) -gt 1000) || (false)'} 2016-09-27 09:44:32,766 - Skipping Execute['/var/lib/ambari-agent/data/tmp/changeUid.sh hbase /home/hbase,/tmp/hbase,/usr/bin/hbase,/var/log/hbase,/tmp/hbase-hbase'] due to not_if 2016-09-27 09:44:32,767 - Group['hdfs'] {'ignore_failures': False} 2016-09-27 09:44:32,768 - User['hdfs'] {'ignore_failures': False, 'groups': ['hadoop', 'hdfs']} 2016-09-27 09:44:32,769 - Directory['/etc/hadoop'] {'mode': 0755} 2016-09-27 09:44:32,789 - File['/usr/hdp/current/hadoop-client/conf/hadoop-env.sh'] {'content': InlineTemplate(...), 'owner': 'hdfs', 'group': 'hadoop'} 2016-09-27 09:44:32,807 - Execute[('setenforce', '0')] {'not_if': '(! which getenforce ) || (which getenforce && getenforce | grep -q Disabled)', 'sudo': True, 'only_if': 'test -f /selinux/enforce'} 2016-09-27 09:44:32,857 - Directory['/var/log/hadoop'] {'owner': 'root', 'mode': 0775, 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'} 2016-09-27 09:44:32,879 - Directory['/var/run/hadoop'] {'owner': 'root', 'group': 'root', 'recursive': True, 'cd_access': 'a'} 2016-09-27 09:44:32,880 - Directory['/tmp/hadoop-hdfs'] {'owner': 'hdfs', 'recursive': True, 'cd_access': 'a'} 2016-09-27 09:44:32,888 - File['/usr/hdp/current/hadoop-client/conf/commons-logging.properties'] {'content': Template('commons-logging.properties.j2'), 'owner': 'hdfs'} 2016-09-27 09:44:32,891 - File['/usr/hdp/current/hadoop-client/conf/health_check'] {'content': Template('health_check.j2'), 'owner': 'hdfs'} 2016-09-27 09:44:32,896 - File['/usr/hdp/current/hadoop-client/conf/log4j.properties'] {'content': ..., 'owner': 'hdfs', 'group': 'hadoop', 'mode': 0644} 2016-09-27 09:44:32,909 - File['/usr/hdp/current/hadoop-client/conf/hadoop-metrics2.properties'] {'content': Template('hadoop-metrics2.properties.j2'), 'owner': 'hdfs'} 2016-09-27 09:44:32,919 - File['/usr/hdp/current/hadoop-client/conf/task-log4j.properties'] {'content': StaticFile('task-log4j.properties'), 'mode': 0755} 2016-09-27 09:44:32,921 - File['/usr/hdp/current/hadoop-client/conf/configuration.xsl'] {'owner': 'hdfs', 'group': 'hadoop'} 2016-09-27 09:44:32,929 - File['/etc/hadoop/conf/topology_mappings.data'] {'owner': 'hdfs', 'content': Template('topology_mappings.data.j2'), 'only_if': 'test -d /etc/hadoop/conf', 'group': 'hadoop'} 2016-09-27 09:44:32,941 - File['/etc/hadoop/conf/topology_script.py'] {'content': StaticFile('topology_script.py'), 'only_if': 'test -d /etc/hadoop/conf', 'mode': 0755} 2016-09-27 09:44:33,397 - Execute['export HADOOP_LIBEXEC_DIR=/usr/hdp/current/hadoop-client/libexec && /usr/hdp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh --config /usr/hdp/current/hadoop-client/conf stop nodemanager'] {'user': 'yarn'} 2016-09-27 09:44:38,656 - Directory['/hadoop/yarn/local'] {'group': 'hadoop', 'recursive': True, 'cd_access': 'a', 'ignore_failures': True, 'mode': 0775, 'owner': 'yarn'} 2016-09-27 09:44:38,659 - Directory['/hadoop/yarn/log'] {'group': 'hadoop', 'recursive': True, 'cd_access': 'a', 'ignore_failures': True, 'mode': 0775, 'owner': 'yarn'} 2016-09-27 09:44:38,659 - Execute[('chown', '-R', 'yarn', '/hadoop/yarn/local/usercache/ambari-qa')] {'sudo': True, 'only_if': 'test -d /hadoop/yarn/local/usercache/ambari-qa'}
Created 09-27-2016 09:49 AM
2016-09-27 09:44:39,168 - File['/usr/hdp/current/hadoop-client/conf/mapred-env.sh'] {'content': InlineTemplate(...), 'owner': 'hdfs'} 2016-09-27 09:44:39,172 - File['/usr/hdp/current/hadoop-client/conf/taskcontroller.cfg'] {'content': Template('taskcontroller.cfg.j2'), 'owner': 'hdfs'} 2016-09-27 09:44:39,179 - XmlConfig['mapred-site.xml'] {'owner': 'mapred', 'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'configuration_attributes': {}, 'configurations': ...} 2016-09-27 09:44:39,191 - Generating config: /usr/hdp/current/hadoop-client/conf/mapred-site.xml 2016-09-27 09:44:39,192 - File['/usr/hdp/current/hadoop-client/conf/mapred-site.xml'] {'owner': 'mapred', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': None, 'encoding': 'UTF-8'} 2016-09-27 09:44:39,239 - Writing File['/usr/hdp/current/hadoop-client/conf/mapred-site.xml'] because contents don't match 2016-09-27 09:44:39,239 - Changing owner for /usr/hdp/current/hadoop-client/conf/mapred-site.xml from 508 to mapred 2016-09-27 09:44:39,240 - XmlConfig['capacity-scheduler.xml'] {'owner': 'hdfs', 'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'configuration_attributes': {}, 'configurations': ...} 2016-09-27 09:44:39,253 - Generating config: /usr/hdp/current/hadoop-client/conf/capacity-scheduler.xml 2016-09-27 09:44:39,253 - File['/usr/hdp/current/hadoop-client/conf/capacity-scheduler.xml'] {'owner': 'hdfs', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': None, 'encoding': 'UTF-8'} 2016-09-27 09:44:39,269 - Changing owner for /usr/hdp/current/hadoop-client/conf/capacity-scheduler.xml from 508 to hdfs 2016-09-27 09:44:39,269 - XmlConfig['ssl-client.xml'] {'owner': 'hdfs', 'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'configuration_attributes': {}, 'configurations': ...} 2016-09-27 09:44:39,282 - Generating config: /usr/hdp/current/hadoop-client/conf/ssl-client.xml 2016-09-27 09:44:39,282 - File['/usr/hdp/current/hadoop-client/conf/ssl-client.xml'] {'owner': 'hdfs', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': None, 'encoding': 'UTF-8'} 2016-09-27 09:44:39,290 - Writing File['/usr/hdp/current/hadoop-client/conf/ssl-client.xml'] because contents don't match 2016-09-27 09:44:39,290 - Directory['/usr/hdp/current/hadoop-client/conf/secure'] {'owner': 'root', 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'} 2016-09-27 09:44:39,312 - XmlConfig['ssl-client.xml'] {'owner': 'hdfs', 'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf/secure', 'configuration_attributes': {}, 'configurations': ...} 2016-09-27 09:44:39,325 - Generating config: /usr/hdp/current/hadoop-client/conf/secure/ssl-client.xml 2016-09-27 09:44:39,325 - File['/usr/hdp/current/hadoop-client/conf/secure/ssl-client.xml'] {'owner': 'hdfs', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': None, 'encoding': 'UTF-8'} 2016-09-27 09:44:39,340 - Writing File['/usr/hdp/current/hadoop-client/conf/secure/ssl-client.xml'] because contents don't match 2016-09-27 09:44:39,341 - XmlConfig['ssl-server.xml'] {'owner': 'hdfs', 'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'configuration_attributes': {}, 'configurations': ...} 2016-09-27 09:44:39,354 - Generating config: /usr/hdp/current/hadoop-client/conf/ssl-server.xml 2016-09-27 09:44:39,354 - File['/usr/hdp/current/hadoop-client/conf/ssl-server.xml'] {'owner': 'hdfs', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': None, 'encoding': 'UTF-8'} 2016-09-27 09:44:39,363 - Writing File['/usr/hdp/current/hadoop-client/conf/ssl-server.xml'] because contents don't match 2016-09-27 09:44:39,364 - File['/usr/hdp/current/hadoop-client/conf/ssl-client.xml.example'] {'owner': 'mapred', 'group': 'hadoop'} 2016-09-27 09:44:39,364 - File['/usr/hdp/current/hadoop-client/conf/ssl-server.xml.example'] {'owner': 'mapred', 'group': 'hadoop'} 2016-09-27 09:44:39,366 - File['/var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'] {'action': ['delete'], 'not_if': 'ls /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid >/dev/null 2>&1 && ps -p `cat /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid` >/dev/null 2>&1'} 2016-09-27 09:44:39,373 - Execute['ulimit -c unlimited; export HADOOP_LIBEXEC_DIR=/usr/hdp/current/hadoop-client/libexec && /usr/hdp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh --config /usr/hdp/current/hadoop-client/conf start nodemanager'] {'not_if': 'ls /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid >/dev/null 2>&1 && ps -p `cat /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid` >/dev/null 2>&1', 'user': 'yarn'} 2016-09-27 09:44:40,596 - Execute['ls /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid >/dev/null 2>&1 && ps -p `cat /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid` >/dev/null 2>&1'] {'not_if': 'ls /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid >/dev/null 2>&1 && ps -p `cat /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid` >/dev/null 2>&1', 'tries': 5, 'user': 'yarn', 'try_sleep': 1} 2016-09-27 09:44:40,798 - Skipping Execute['ls /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid >/dev/null 2>&1 && ps -p `cat /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid` >/dev/null 2>&1'] due to not_if
Created 09-27-2016 10:07 AM
2016-09-27 09:44:38,711 - Directory['/var/run/hadoop-yarn'] {'owner': 'yarn', 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'} 2016-09-27 09:44:38,712 - Directory['/var/run/hadoop-yarn/yarn'] {'owner': 'yarn', 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'} 2016-09-27 09:44:38,713 - Directory['/var/log/hadoop-yarn/yarn'] {'owner': 'yarn', 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'} 2016-09-27 09:44:38,715 - Directory['/var/run/hadoop-mapreduce'] {'owner': 'mapred', 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'} 2016-09-27 09:44:38,717 - Directory['/var/run/hadoop-mapreduce/mapred'] {'owner': 'mapred', 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'} 2016-09-27 09:44:38,717 - Directory['/var/log/hadoop-mapreduce'] {'owner': 'mapred', 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'} 2016-09-27 09:44:38,718 - Directory['/var/log/hadoop-mapreduce/mapred'] {'owner': 'mapred', 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'} 2016-09-27 09:44:38,719 - Directory['/var/log/hadoop-yarn'] {'owner': 'yarn', 'ignore_failures': True, 'recursive': True, 'cd_access': 'a'} 2016-09-27 09:44:38,720 - XmlConfig['core-site.xml'] {'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'mode': 0644, 'configuration_attributes': {}, 'owner': 'hdfs', 'configurations': ...} 2016-09-27 09:44:38,752 - Generating config: /usr/hdp/current/hadoop-client/conf/core-site.xml 2016-09-27 09:44:38,752 - File['/usr/hdp/current/hadoop-client/conf/core-site.xml'] {'owner': 'hdfs', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': 0644, 'encoding': 'UTF-8'} 2016-09-27 09:44:38,779 - Writing File['/usr/hdp/current/hadoop-client/conf/core-site.xml'] because contents don't match 2016-09-27 09:44:38,780 - XmlConfig['hdfs-site.xml'] {'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'mode': 0644, 'configuration_attributes': {'final': {'dfs.datanode.data.dir': 'true'}}, 'owner': 'hdfs', 'configurations': ...} 2016-09-27 09:44:38,793 - Generating config: /usr/hdp/current/hadoop-client/conf/hdfs-site.xml 2016-09-27 09:44:38,793 - File['/usr/hdp/current/hadoop-client/conf/hdfs-site.xml'] {'owner': 'hdfs', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': 0644, 'encoding': 'UTF-8'} 2016-09-27 09:44:38,860 - Writing File['/usr/hdp/current/hadoop-client/conf/hdfs-site.xml'] because contents don't match 2016-09-27 09:44:38,861 - XmlConfig['mapred-site.xml'] {'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'mode': 0644, 'configuration_attributes': {}, 'owner': 'yarn', 'configurations': ...} 2016-09-27 09:44:38,874 - Generating config: /usr/hdp/current/hadoop-client/conf/mapred-site.xml 2016-09-27 09:44:38,874 - File['/usr/hdp/current/hadoop-client/conf/mapred-site.xml'] {'owner': 'yarn', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': 0644, 'encoding': 'UTF-8'} 2016-09-27 09:44:38,923 - Writing File['/usr/hdp/current/hadoop-client/conf/mapred-site.xml'] because contents don't match 2016-09-27 09:44:38,924 - Changing owner for /usr/hdp/current/hadoop-client/conf/mapred-site.xml from 501 to yarn 2016-09-27 09:44:38,924 - XmlConfig['yarn-site.xml'] {'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'mode': 0644, 'configuration_attributes': {}, 'owner': 'yarn', 'configurations': ...} 2016-09-27 09:44:38,937 - Generating config: /usr/hdp/current/hadoop-client/conf/yarn-site.xml 2016-09-27 09:44:38,937 - File['/usr/hdp/current/hadoop-client/conf/yarn-site.xml'] {'owner': 'yarn', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': 0644, 'encoding': 'UTF-8'} 2016-09-27 09:44:39,050 - Writing File['/usr/hdp/current/hadoop-client/conf/yarn-site.xml'] because contents don't match 2016-09-27 09:44:39,050 - XmlConfig['capacity-scheduler.xml'] {'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'mode': 0644, 'configuration_attributes': {}, 'owner': 'yarn', 'configurations': ...} 2016-09-27 09:44:39,063 - Generating config: /usr/hdp/current/hadoop-client/conf/capacity-scheduler.xml 2016-09-27 09:44:39,064 - File['/usr/hdp/current/hadoop-client/conf/capacity-scheduler.xml'] {'owner': 'yarn', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': 0644, 'encoding': 'UTF-8'} 2016-09-27 09:44:39,100 - Writing File['/usr/hdp/current/hadoop-client/conf/capacity-scheduler.xml'] because contents don't match 2016-09-27 09:44:39,101 - Changing owner for /usr/hdp/current/hadoop-client/conf/capacity-scheduler.xml from 506 to yarn 2016-09-27 09:44:39,101 - File['/etc/hadoop/conf/yarn.exclude'] {'owner': 'yarn', 'group': 'hadoop'} 2016-09-27 09:44:39,123 - File['/etc/security/limits.d/yarn.conf'] {'content': Template('yarn.conf.j2'), 'mode': 0644} 2016-09-27 09:44:39,127 - File['/etc/security/limits.d/mapreduce.conf'] {'content': Template('mapreduce.conf.j2'), 'mode': 0644} 2016-09-27 09:44:39,133 - File['/usr/hdp/current/hadoop-client/conf/yarn-env.sh'] {'content': InlineTemplate(...), 'owner': 'yarn', 'group': 'hadoop', 'mode': 0755} 2016-09-27 09:44:39,134 - Writing File['/usr/hdp/current/hadoop-client/conf/yarn-env.sh'] because contents don't match 2016-09-27 09:44:39,135 - File['/usr/hdp/current/hadoop-yarn-nodemanager/bin/container-executor'] {'group': 'hadoop', 'mode': 02050} 2016-09-27 09:44:39,143 - File['/usr/hdp/current/hadoop-client/conf/container-executor.cfg'] {'content': Template('container-executor.cfg.j2'), 'group': 'hadoop', 'mode': 0644} 2016-09-27 09:44:39,148 - Directory['/cgroups_test/cpu'] {'mode': 0755, 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'}
Created 09-27-2016 11:49 AM
@Mourad Chahri You can go to the ResourceManager UI. From there you should see a nodes link on the left side of the screen. If you click on that, you should see all of your NodeManagers and the reason for it being listed as unhealthy may be shown here. It is most likely due to yarn local dirs or log dirs. You may be hitting the disk threshold for this. There are a couple of parameters you can check for this.
yarn.nodemanager.disk-health-checker.min-healthy-disks
yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage
yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb
Finally, if that does not reveal the issue, you should look in /var/log/hadoop-yarn/yarn. Your previous comment shows you were looking in /var/log/hadoop/yarn which is not where the NodeManager log is located.
I hope this helps.