Support Questions

Find answers, ask questions, and share your expertise

how to repair Unhealthy Nodemanager ??

avatar
Expert Contributor

how to repair Unhealthy Nodemanager ??

i restart Yarn service but i have 4 nodemanagers started and 1 unhealthy , when i try to ckeck

/var/log/hadoop/yarn i dont find any log , so how to repair Unhealthy Nodemanager

1 ACCEPTED SOLUTION

avatar
Expert Contributor

i found the solution go to

yarn.nodemanager.disk-health-checker.min-healthy-disks

and change the value to 0 and restart yarn and it gonna work.

View solution in original post

13 REPLIES 13

avatar

@Mourad Chahri Can you check if you have enough disk available on the node ?

avatar
Expert Contributor

@Sandeep Nemuri yes i have enough space on disk

avatar
@Mourad Chahri

Could you please check from Ambari - reason for unhealthy node?

avatar
Expert Contributor

@Sindhu

i can check just that

1 NodeManager is unhealthy

avatar
Super Collaborator

@Mourad Chahri Can you please restart only the unhealthy nodemanager and check if its coming up correctly?

If it fails, please share the error message. You can find the error message from ambari start service dialogue window.

Please let me know if you have any questions regarding this. Happy to help.

avatar
Expert Contributor

yes i can restart the unhealthy nodemanager i have this on log

2016-09-27 09:44:32,687 - Group['hadoop'] {'ignore_failures': False}
2016-09-27 09:44:32,690 - Group['users'] {'ignore_failures': False}
2016-09-27 09:44:32,691 - User['hive'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']}
2016-09-27 09:44:32,692 - User['mapred'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']}
2016-09-27 09:44:32,693 - User['accumulo'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']}
2016-09-27 09:44:32,694 - User['hbase'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']}
2016-09-27 09:44:32,695 - User['ambari-qa'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['users']}
2016-09-27 09:44:32,696 - User['zookeeper'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']}
2016-09-27 09:44:32,697 - User['tez'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['users']}
2016-09-27 09:44:32,698 - User['hdfs'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']}
2016-09-27 09:44:32,699 - User['sqoop'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']}
2016-09-27 09:44:32,700 - User['hcat'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']}
2016-09-27 09:44:32,701 - User['yarn'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']}
2016-09-27 09:44:32,702 - User['ams'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']}
2016-09-27 09:44:32,703 - File['/var/lib/ambari-agent/data/tmp/changeUid.sh'] {'content': StaticFile('changeToSecureUid.sh'), 'mode': 0555}
2016-09-27 09:44:32,734 - Execute['/var/lib/ambari-agent/data/tmp/changeUid.sh ambari-qa /tmp/hadoop-ambari-qa,/tmp/hsperfdata_ambari-qa,/home/ambari-qa,/tmp/ambari-qa,/tmp/sqoop-ambari-qa'] {'not_if': '(test $(id -u ambari-qa) -gt 1000) || (false)'}
2016-09-27 09:44:32,741 - Skipping Execute['/var/lib/ambari-agent/data/tmp/changeUid.sh ambari-qa /tmp/hadoop-ambari-qa,/tmp/hsperfdata_ambari-qa,/home/ambari-qa,/tmp/ambari-qa,/tmp/sqoop-ambari-qa'] due to not_if
2016-09-27 09:44:32,742 - Directory['/tmp/hbase-hbase'] {'owner': 'hbase', 'recursive': True, 'mode': 0775, 'cd_access': 'a'}
2016-09-27 09:44:32,757 - File['/var/lib/ambari-agent/data/tmp/changeUid.sh'] {'content': StaticFile('changeToSecureUid.sh'), 'mode': 0555}
2016-09-27 09:44:32,759 - Execute['/var/lib/ambari-agent/data/tmp/changeUid.sh hbase /home/hbase,/tmp/hbase,/usr/bin/hbase,/var/log/hbase,/tmp/hbase-hbase'] {'not_if': '(test $(id -u hbase) -gt 1000) || (false)'}
2016-09-27 09:44:32,766 - Skipping Execute['/var/lib/ambari-agent/data/tmp/changeUid.sh hbase /home/hbase,/tmp/hbase,/usr/bin/hbase,/var/log/hbase,/tmp/hbase-hbase'] due to not_if
2016-09-27 09:44:32,767 - Group['hdfs'] {'ignore_failures': False}
2016-09-27 09:44:32,768 - User['hdfs'] {'ignore_failures': False, 'groups': ['hadoop', 'hdfs']}
2016-09-27 09:44:32,769 - Directory['/etc/hadoop'] {'mode': 0755}
2016-09-27 09:44:32,789 - File['/usr/hdp/current/hadoop-client/conf/hadoop-env.sh'] {'content': InlineTemplate(...), 'owner': 'hdfs', 'group': 'hadoop'}
2016-09-27 09:44:32,807 - Execute[('setenforce', '0')] {'not_if': '(! which getenforce ) || (which getenforce && getenforce | grep -q Disabled)', 'sudo': True, 'only_if': 'test -f /selinux/enforce'}
2016-09-27 09:44:32,857 - Directory['/var/log/hadoop'] {'owner': 'root', 'mode': 0775, 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'}
2016-09-27 09:44:32,879 - Directory['/var/run/hadoop'] {'owner': 'root', 'group': 'root', 'recursive': True, 'cd_access': 'a'}
2016-09-27 09:44:32,880 - Directory['/tmp/hadoop-hdfs'] {'owner': 'hdfs', 'recursive': True, 'cd_access': 'a'}
2016-09-27 09:44:32,888 - File['/usr/hdp/current/hadoop-client/conf/commons-logging.properties'] {'content': Template('commons-logging.properties.j2'), 'owner': 'hdfs'}
2016-09-27 09:44:32,891 - File['/usr/hdp/current/hadoop-client/conf/health_check'] {'content': Template('health_check.j2'), 'owner': 'hdfs'}
2016-09-27 09:44:32,896 - File['/usr/hdp/current/hadoop-client/conf/log4j.properties'] {'content': ..., 'owner': 'hdfs', 'group': 'hadoop', 'mode': 0644}
2016-09-27 09:44:32,909 - File['/usr/hdp/current/hadoop-client/conf/hadoop-metrics2.properties'] {'content': Template('hadoop-metrics2.properties.j2'), 'owner': 'hdfs'}
2016-09-27 09:44:32,919 - File['/usr/hdp/current/hadoop-client/conf/task-log4j.properties'] {'content': StaticFile('task-log4j.properties'), 'mode': 0755}
2016-09-27 09:44:32,921 - File['/usr/hdp/current/hadoop-client/conf/configuration.xsl'] {'owner': 'hdfs', 'group': 'hadoop'}
2016-09-27 09:44:32,929 - File['/etc/hadoop/conf/topology_mappings.data'] {'owner': 'hdfs', 'content': Template('topology_mappings.data.j2'), 'only_if': 'test -d /etc/hadoop/conf', 'group': 'hadoop'}
2016-09-27 09:44:32,941 - File['/etc/hadoop/conf/topology_script.py'] {'content': StaticFile('topology_script.py'), 'only_if': 'test -d /etc/hadoop/conf', 'mode': 0755}
2016-09-27 09:44:33,397 - Execute['export HADOOP_LIBEXEC_DIR=/usr/hdp/current/hadoop-client/libexec && /usr/hdp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh --config /usr/hdp/current/hadoop-client/conf stop nodemanager'] {'user': 'yarn'}
2016-09-27 09:44:38,656 - Directory['/hadoop/yarn/local'] {'group': 'hadoop', 'recursive': True, 'cd_access': 'a', 'ignore_failures': True, 'mode': 0775, 'owner': 'yarn'}
2016-09-27 09:44:38,659 - Directory['/hadoop/yarn/log'] {'group': 'hadoop', 'recursive': True, 'cd_access': 'a', 'ignore_failures': True, 'mode': 0775, 'owner': 'yarn'}
2016-09-27 09:44:38,659 - Execute[('chown', '-R', 'yarn', '/hadoop/yarn/local/usercache/ambari-qa')] {'sudo': True, 'only_if': 'test -d /hadoop/yarn/local/usercache/ambari-qa'}

avatar
Expert Contributor
2016-09-27 09:44:39,168 - File['/usr/hdp/current/hadoop-client/conf/mapred-env.sh'] {'content': InlineTemplate(...), 'owner': 'hdfs'}
2016-09-27 09:44:39,172 - File['/usr/hdp/current/hadoop-client/conf/taskcontroller.cfg'] {'content': Template('taskcontroller.cfg.j2'), 'owner': 'hdfs'}
2016-09-27 09:44:39,179 - XmlConfig['mapred-site.xml'] {'owner': 'mapred', 'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'configuration_attributes': {}, 'configurations': ...}
2016-09-27 09:44:39,191 - Generating config: /usr/hdp/current/hadoop-client/conf/mapred-site.xml
2016-09-27 09:44:39,192 - File['/usr/hdp/current/hadoop-client/conf/mapred-site.xml'] {'owner': 'mapred', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': None, 'encoding': 'UTF-8'}
2016-09-27 09:44:39,239 - Writing File['/usr/hdp/current/hadoop-client/conf/mapred-site.xml'] because contents don't match
2016-09-27 09:44:39,239 - Changing owner for /usr/hdp/current/hadoop-client/conf/mapred-site.xml from 508 to mapred
2016-09-27 09:44:39,240 - XmlConfig['capacity-scheduler.xml'] {'owner': 'hdfs', 'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'configuration_attributes': {}, 'configurations': ...}
2016-09-27 09:44:39,253 - Generating config: /usr/hdp/current/hadoop-client/conf/capacity-scheduler.xml
2016-09-27 09:44:39,253 - File['/usr/hdp/current/hadoop-client/conf/capacity-scheduler.xml'] {'owner': 'hdfs', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': None, 'encoding': 'UTF-8'}
2016-09-27 09:44:39,269 - Changing owner for /usr/hdp/current/hadoop-client/conf/capacity-scheduler.xml from 508 to hdfs
2016-09-27 09:44:39,269 - XmlConfig['ssl-client.xml'] {'owner': 'hdfs', 'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'configuration_attributes': {}, 'configurations': ...}
2016-09-27 09:44:39,282 - Generating config: /usr/hdp/current/hadoop-client/conf/ssl-client.xml
2016-09-27 09:44:39,282 - File['/usr/hdp/current/hadoop-client/conf/ssl-client.xml'] {'owner': 'hdfs', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': None, 'encoding': 'UTF-8'}
2016-09-27 09:44:39,290 - Writing File['/usr/hdp/current/hadoop-client/conf/ssl-client.xml'] because contents don't match
2016-09-27 09:44:39,290 - Directory['/usr/hdp/current/hadoop-client/conf/secure'] {'owner': 'root', 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'}
2016-09-27 09:44:39,312 - XmlConfig['ssl-client.xml'] {'owner': 'hdfs', 'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf/secure', 'configuration_attributes': {}, 'configurations': ...}
2016-09-27 09:44:39,325 - Generating config: /usr/hdp/current/hadoop-client/conf/secure/ssl-client.xml
2016-09-27 09:44:39,325 - File['/usr/hdp/current/hadoop-client/conf/secure/ssl-client.xml'] {'owner': 'hdfs', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': None, 'encoding': 'UTF-8'}
2016-09-27 09:44:39,340 - Writing File['/usr/hdp/current/hadoop-client/conf/secure/ssl-client.xml'] because contents don't match
2016-09-27 09:44:39,341 - XmlConfig['ssl-server.xml'] {'owner': 'hdfs', 'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'configuration_attributes': {}, 'configurations': ...}
2016-09-27 09:44:39,354 - Generating config: /usr/hdp/current/hadoop-client/conf/ssl-server.xml
2016-09-27 09:44:39,354 - File['/usr/hdp/current/hadoop-client/conf/ssl-server.xml'] {'owner': 'hdfs', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': None, 'encoding': 'UTF-8'}
2016-09-27 09:44:39,363 - Writing File['/usr/hdp/current/hadoop-client/conf/ssl-server.xml'] because contents don't match
2016-09-27 09:44:39,364 - File['/usr/hdp/current/hadoop-client/conf/ssl-client.xml.example'] {'owner': 'mapred', 'group': 'hadoop'}
2016-09-27 09:44:39,364 - File['/usr/hdp/current/hadoop-client/conf/ssl-server.xml.example'] {'owner': 'mapred', 'group': 'hadoop'}
2016-09-27 09:44:39,366 - File['/var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'] {'action': ['delete'], 'not_if': 'ls /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid >/dev/null 2>&1 && ps -p `cat /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid` >/dev/null 2>&1'}
2016-09-27 09:44:39,373 - Execute['ulimit -c unlimited; export HADOOP_LIBEXEC_DIR=/usr/hdp/current/hadoop-client/libexec && /usr/hdp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh --config /usr/hdp/current/hadoop-client/conf start nodemanager'] {'not_if': 'ls /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid >/dev/null 2>&1 && ps -p `cat /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid` >/dev/null 2>&1', 'user': 'yarn'}
2016-09-27 09:44:40,596 - Execute['ls /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid >/dev/null 2>&1 && ps -p `cat /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid` >/dev/null 2>&1'] {'not_if': 'ls /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid >/dev/null 2>&1 && ps -p `cat /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid` >/dev/null 2>&1', 'tries': 5, 'user': 'yarn', 'try_sleep': 1}
2016-09-27 09:44:40,798 - Skipping Execute['ls /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid >/dev/null 2>&1 && ps -p `cat /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid` >/dev/null 2>&1'] due to not_if

avatar
Expert Contributor
2016-09-27 09:44:38,711 - Directory['/var/run/hadoop-yarn'] {'owner': 'yarn', 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'}
2016-09-27 09:44:38,712 - Directory['/var/run/hadoop-yarn/yarn'] {'owner': 'yarn', 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'}
2016-09-27 09:44:38,713 - Directory['/var/log/hadoop-yarn/yarn'] {'owner': 'yarn', 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'}
2016-09-27 09:44:38,715 - Directory['/var/run/hadoop-mapreduce'] {'owner': 'mapred', 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'}
2016-09-27 09:44:38,717 - Directory['/var/run/hadoop-mapreduce/mapred'] {'owner': 'mapred', 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'}
2016-09-27 09:44:38,717 - Directory['/var/log/hadoop-mapreduce'] {'owner': 'mapred', 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'}
2016-09-27 09:44:38,718 - Directory['/var/log/hadoop-mapreduce/mapred'] {'owner': 'mapred', 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'}
2016-09-27 09:44:38,719 - Directory['/var/log/hadoop-yarn'] {'owner': 'yarn', 'ignore_failures': True, 'recursive': True, 'cd_access': 'a'}
2016-09-27 09:44:38,720 - XmlConfig['core-site.xml'] {'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'mode': 0644, 'configuration_attributes': {}, 'owner': 'hdfs', 'configurations': ...}
2016-09-27 09:44:38,752 - Generating config: /usr/hdp/current/hadoop-client/conf/core-site.xml
2016-09-27 09:44:38,752 - File['/usr/hdp/current/hadoop-client/conf/core-site.xml'] {'owner': 'hdfs', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': 0644, 'encoding': 'UTF-8'}
2016-09-27 09:44:38,779 - Writing File['/usr/hdp/current/hadoop-client/conf/core-site.xml'] because contents don't match
2016-09-27 09:44:38,780 - XmlConfig['hdfs-site.xml'] {'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'mode': 0644, 'configuration_attributes': {'final': {'dfs.datanode.data.dir': 'true'}}, 'owner': 'hdfs', 'configurations': ...}
2016-09-27 09:44:38,793 - Generating config: /usr/hdp/current/hadoop-client/conf/hdfs-site.xml
2016-09-27 09:44:38,793 - File['/usr/hdp/current/hadoop-client/conf/hdfs-site.xml'] {'owner': 'hdfs', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': 0644, 'encoding': 'UTF-8'}
2016-09-27 09:44:38,860 - Writing File['/usr/hdp/current/hadoop-client/conf/hdfs-site.xml'] because contents don't match
2016-09-27 09:44:38,861 - XmlConfig['mapred-site.xml'] {'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'mode': 0644, 'configuration_attributes': {}, 'owner': 'yarn', 'configurations': ...}
2016-09-27 09:44:38,874 - Generating config: /usr/hdp/current/hadoop-client/conf/mapred-site.xml
2016-09-27 09:44:38,874 - File['/usr/hdp/current/hadoop-client/conf/mapred-site.xml'] {'owner': 'yarn', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': 0644, 'encoding': 'UTF-8'}
2016-09-27 09:44:38,923 - Writing File['/usr/hdp/current/hadoop-client/conf/mapred-site.xml'] because contents don't match
2016-09-27 09:44:38,924 - Changing owner for /usr/hdp/current/hadoop-client/conf/mapred-site.xml from 501 to yarn
2016-09-27 09:44:38,924 - XmlConfig['yarn-site.xml'] {'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'mode': 0644, 'configuration_attributes': {}, 'owner': 'yarn', 'configurations': ...}
2016-09-27 09:44:38,937 - Generating config: /usr/hdp/current/hadoop-client/conf/yarn-site.xml
2016-09-27 09:44:38,937 - File['/usr/hdp/current/hadoop-client/conf/yarn-site.xml'] {'owner': 'yarn', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': 0644, 'encoding': 'UTF-8'}
2016-09-27 09:44:39,050 - Writing File['/usr/hdp/current/hadoop-client/conf/yarn-site.xml'] because contents don't match
2016-09-27 09:44:39,050 - XmlConfig['capacity-scheduler.xml'] {'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'mode': 0644, 'configuration_attributes': {}, 'owner': 'yarn', 'configurations': ...}
2016-09-27 09:44:39,063 - Generating config: /usr/hdp/current/hadoop-client/conf/capacity-scheduler.xml
2016-09-27 09:44:39,064 - File['/usr/hdp/current/hadoop-client/conf/capacity-scheduler.xml'] {'owner': 'yarn', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': 0644, 'encoding': 'UTF-8'}
2016-09-27 09:44:39,100 - Writing File['/usr/hdp/current/hadoop-client/conf/capacity-scheduler.xml'] because contents don't match
2016-09-27 09:44:39,101 - Changing owner for /usr/hdp/current/hadoop-client/conf/capacity-scheduler.xml from 506 to yarn
2016-09-27 09:44:39,101 - File['/etc/hadoop/conf/yarn.exclude'] {'owner': 'yarn', 'group': 'hadoop'}
2016-09-27 09:44:39,123 - File['/etc/security/limits.d/yarn.conf'] {'content': Template('yarn.conf.j2'), 'mode': 0644}
2016-09-27 09:44:39,127 - File['/etc/security/limits.d/mapreduce.conf'] {'content': Template('mapreduce.conf.j2'), 'mode': 0644}
2016-09-27 09:44:39,133 - File['/usr/hdp/current/hadoop-client/conf/yarn-env.sh'] {'content': InlineTemplate(...), 'owner': 'yarn', 'group': 'hadoop', 'mode': 0755}
2016-09-27 09:44:39,134 - Writing File['/usr/hdp/current/hadoop-client/conf/yarn-env.sh'] because contents don't match
2016-09-27 09:44:39,135 - File['/usr/hdp/current/hadoop-yarn-nodemanager/bin/container-executor'] {'group': 'hadoop', 'mode': 02050}
2016-09-27 09:44:39,143 - File['/usr/hdp/current/hadoop-client/conf/container-executor.cfg'] {'content': Template('container-executor.cfg.j2'), 'group': 'hadoop', 'mode': 0644}
2016-09-27 09:44:39,148 - Directory['/cgroups_test/cpu'] {'mode': 0755, 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'}

avatar
Expert Contributor

@Mourad Chahri You can go to the ResourceManager UI. From there you should see a nodes link on the left side of the screen. If you click on that, you should see all of your NodeManagers and the reason for it being listed as unhealthy may be shown here. It is most likely due to yarn local dirs or log dirs. You may be hitting the disk threshold for this. There are a couple of parameters you can check for this.

yarn.nodemanager.disk-health-checker.min-healthy-disks

yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage

yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb

Finally, if that does not reveal the issue, you should look in /var/log/hadoop-yarn/yarn. Your previous comment shows you were looking in /var/log/hadoop/yarn which is not where the NodeManager log is located.

I hope this helps.