Support Questions

chahrimourad · ‎09-27-2016

how to repair Unhealthy Nodemanager ??

i restart Yarn service but i have 4 nodemanagers started and 1 unhealthy , when i try to ckeck

/var/log/hadoop/yarn i dont find any log , so how to repair Unhealthy Nodemanager

chahrimourad · ‎09-27-2016

i found the solution go to

yarn.nodemanager.disk-health-checker.min-healthy-disks

and change the value to 0 and restart yarn and it gonna work.

View solution in original post

sandyy006 · ‎09-27-2016

@Mourad Chahri Can you check if you have enough disk available on the node ?

chahrimourad · ‎09-27-2016

@Sandeep Nemuri yes i have enough space on disk

ssubhas · ‎09-27-2016

@Mourad Chahri

Could you please check from Ambari - reason for unhealthy node?

chahrimourad · ‎09-27-2016

@Sindhu

i can check just that

1 NodeManager is unhealthy

ssharma · ‎09-27-2016

@Mourad Chahri Can you please restart only the unhealthy nodemanager and check if its coming up correctly?

If it fails, please share the error message. You can find the error message from ambari start service dialogue window.

Please let me know if you have any questions regarding this. Happy to help.

chahrimourad · ‎09-27-2016

yes i can restart the unhealthy nodemanager i have this on log

2016-09-27 09:44:32,687 - Group['hadoop'] {'ignore_failures': False}
2016-09-27 09:44:32,690 - Group['users'] {'ignore_failures': False}
2016-09-27 09:44:32,691 - User['hive'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']}
2016-09-27 09:44:32,692 - User['mapred'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']}
2016-09-27 09:44:32,693 - User['accumulo'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']}
2016-09-27 09:44:32,694 - User['hbase'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']}
2016-09-27 09:44:32,695 - User['ambari-qa'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['users']}
2016-09-27 09:44:32,696 - User['zookeeper'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']}
2016-09-27 09:44:32,697 - User['tez'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['users']}
2016-09-27 09:44:32,698 - User['hdfs'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']}
2016-09-27 09:44:32,699 - User['sqoop'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']}
2016-09-27 09:44:32,700 - User['hcat'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']}
2016-09-27 09:44:32,701 - User['yarn'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']}
2016-09-27 09:44:32,702 - User['ams'] {'gid': 'hadoop', 'ignore_failures': False, 'groups': ['hadoop']}
2016-09-27 09:44:32,703 - File['/var/lib/ambari-agent/data/tmp/changeUid.sh'] {'content': StaticFile('changeToSecureUid.sh'), 'mode': 0555}
2016-09-27 09:44:32,734 - Execute['/var/lib/ambari-agent/data/tmp/changeUid.sh ambari-qa /tmp/hadoop-ambari-qa,/tmp/hsperfdata_ambari-qa,/home/ambari-qa,/tmp/ambari-qa,/tmp/sqoop-ambari-qa'] {'not_if': '(test $(id -u ambari-qa) -gt 1000) || (false)'}
2016-09-27 09:44:32,741 - Skipping Execute['/var/lib/ambari-agent/data/tmp/changeUid.sh ambari-qa /tmp/hadoop-ambari-qa,/tmp/hsperfdata_ambari-qa,/home/ambari-qa,/tmp/ambari-qa,/tmp/sqoop-ambari-qa'] due to not_if
2016-09-27 09:44:32,742 - Directory['/tmp/hbase-hbase'] {'owner': 'hbase', 'recursive': True, 'mode': 0775, 'cd_access': 'a'}
2016-09-27 09:44:32,757 - File['/var/lib/ambari-agent/data/tmp/changeUid.sh'] {'content': StaticFile('changeToSecureUid.sh'), 'mode': 0555}
2016-09-27 09:44:32,759 - Execute['/var/lib/ambari-agent/data/tmp/changeUid.sh hbase /home/hbase,/tmp/hbase,/usr/bin/hbase,/var/log/hbase,/tmp/hbase-hbase'] {'not_if': '(test $(id -u hbase) -gt 1000) || (false)'}
2016-09-27 09:44:32,766 - Skipping Execute['/var/lib/ambari-agent/data/tmp/changeUid.sh hbase /home/hbase,/tmp/hbase,/usr/bin/hbase,/var/log/hbase,/tmp/hbase-hbase'] due to not_if
2016-09-27 09:44:32,767 - Group['hdfs'] {'ignore_failures': False}
2016-09-27 09:44:32,768 - User['hdfs'] {'ignore_failures': False, 'groups': ['hadoop', 'hdfs']}
2016-09-27 09:44:32,769 - Directory['/etc/hadoop'] {'mode': 0755}
2016-09-27 09:44:32,789 - File['/usr/hdp/current/hadoop-client/conf/hadoop-env.sh'] {'content': InlineTemplate(...), 'owner': 'hdfs', 'group': 'hadoop'}
2016-09-27 09:44:32,807 - Execute[('setenforce', '0')] {'not_if': '(! which getenforce ) || (which getenforce && getenforce | grep -q Disabled)', 'sudo': True, 'only_if': 'test -f /selinux/enforce'}
2016-09-27 09:44:32,857 - Directory['/var/log/hadoop'] {'owner': 'root', 'mode': 0775, 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'}
2016-09-27 09:44:32,879 - Directory['/var/run/hadoop'] {'owner': 'root', 'group': 'root', 'recursive': True, 'cd_access': 'a'}
2016-09-27 09:44:32,880 - Directory['/tmp/hadoop-hdfs'] {'owner': 'hdfs', 'recursive': True, 'cd_access': 'a'}
2016-09-27 09:44:32,888 - File['/usr/hdp/current/hadoop-client/conf/commons-logging.properties'] {'content': Template('commons-logging.properties.j2'), 'owner': 'hdfs'}
2016-09-27 09:44:32,891 - File['/usr/hdp/current/hadoop-client/conf/health_check'] {'content': Template('health_check.j2'), 'owner': 'hdfs'}
2016-09-27 09:44:32,896 - File['/usr/hdp/current/hadoop-client/conf/log4j.properties'] {'content': ..., 'owner': 'hdfs', 'group': 'hadoop', 'mode': 0644}
2016-09-27 09:44:32,909 - File['/usr/hdp/current/hadoop-client/conf/hadoop-metrics2.properties'] {'content': Template('hadoop-metrics2.properties.j2'), 'owner': 'hdfs'}
2016-09-27 09:44:32,919 - File['/usr/hdp/current/hadoop-client/conf/task-log4j.properties'] {'content': StaticFile('task-log4j.properties'), 'mode': 0755}
2016-09-27 09:44:32,921 - File['/usr/hdp/current/hadoop-client/conf/configuration.xsl'] {'owner': 'hdfs', 'group': 'hadoop'}
2016-09-27 09:44:32,929 - File['/etc/hadoop/conf/topology_mappings.data'] {'owner': 'hdfs', 'content': Template('topology_mappings.data.j2'), 'only_if': 'test -d /etc/hadoop/conf', 'group': 'hadoop'}
2016-09-27 09:44:32,941 - File['/etc/hadoop/conf/topology_script.py'] {'content': StaticFile('topology_script.py'), 'only_if': 'test -d /etc/hadoop/conf', 'mode': 0755}
2016-09-27 09:44:33,397 - Execute['export HADOOP_LIBEXEC_DIR=/usr/hdp/current/hadoop-client/libexec && /usr/hdp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh --config /usr/hdp/current/hadoop-client/conf stop nodemanager'] {'user': 'yarn'}
2016-09-27 09:44:38,656 - Directory['/hadoop/yarn/local'] {'group': 'hadoop', 'recursive': True, 'cd_access': 'a', 'ignore_failures': True, 'mode': 0775, 'owner': 'yarn'}
2016-09-27 09:44:38,659 - Directory['/hadoop/yarn/log'] {'group': 'hadoop', 'recursive': True, 'cd_access': 'a', 'ignore_failures': True, 'mode': 0775, 'owner': 'yarn'}
2016-09-27 09:44:38,659 - Execute[('chown', '-R', 'yarn', '/hadoop/yarn/local/usercache/ambari-qa')] {'sudo': True, 'only_if': 'test -d /hadoop/yarn/local/usercache/ambari-qa'}

chahrimourad · ‎09-27-2016

2016-09-27 09:44:39,168 - File['/usr/hdp/current/hadoop-client/conf/mapred-env.sh'] {'content': InlineTemplate(...), 'owner': 'hdfs'}
2016-09-27 09:44:39,172 - File['/usr/hdp/current/hadoop-client/conf/taskcontroller.cfg'] {'content': Template('taskcontroller.cfg.j2'), 'owner': 'hdfs'}
2016-09-27 09:44:39,179 - XmlConfig['mapred-site.xml'] {'owner': 'mapred', 'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'configuration_attributes': {}, 'configurations': ...}
2016-09-27 09:44:39,191 - Generating config: /usr/hdp/current/hadoop-client/conf/mapred-site.xml
2016-09-27 09:44:39,192 - File['/usr/hdp/current/hadoop-client/conf/mapred-site.xml'] {'owner': 'mapred', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': None, 'encoding': 'UTF-8'}
2016-09-27 09:44:39,239 - Writing File['/usr/hdp/current/hadoop-client/conf/mapred-site.xml'] because contents don't match
2016-09-27 09:44:39,239 - Changing owner for /usr/hdp/current/hadoop-client/conf/mapred-site.xml from 508 to mapred
2016-09-27 09:44:39,240 - XmlConfig['capacity-scheduler.xml'] {'owner': 'hdfs', 'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'configuration_attributes': {}, 'configurations': ...}
2016-09-27 09:44:39,253 - Generating config: /usr/hdp/current/hadoop-client/conf/capacity-scheduler.xml
2016-09-27 09:44:39,253 - File['/usr/hdp/current/hadoop-client/conf/capacity-scheduler.xml'] {'owner': 'hdfs', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': None, 'encoding': 'UTF-8'}
2016-09-27 09:44:39,269 - Changing owner for /usr/hdp/current/hadoop-client/conf/capacity-scheduler.xml from 508 to hdfs
2016-09-27 09:44:39,269 - XmlConfig['ssl-client.xml'] {'owner': 'hdfs', 'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'configuration_attributes': {}, 'configurations': ...}
2016-09-27 09:44:39,282 - Generating config: /usr/hdp/current/hadoop-client/conf/ssl-client.xml
2016-09-27 09:44:39,282 - File['/usr/hdp/current/hadoop-client/conf/ssl-client.xml'] {'owner': 'hdfs', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': None, 'encoding': 'UTF-8'}
2016-09-27 09:44:39,290 - Writing File['/usr/hdp/current/hadoop-client/conf/ssl-client.xml'] because contents don't match
2016-09-27 09:44:39,290 - Directory['/usr/hdp/current/hadoop-client/conf/secure'] {'owner': 'root', 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'}
2016-09-27 09:44:39,312 - XmlConfig['ssl-client.xml'] {'owner': 'hdfs', 'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf/secure', 'configuration_attributes': {}, 'configurations': ...}
2016-09-27 09:44:39,325 - Generating config: /usr/hdp/current/hadoop-client/conf/secure/ssl-client.xml
2016-09-27 09:44:39,325 - File['/usr/hdp/current/hadoop-client/conf/secure/ssl-client.xml'] {'owner': 'hdfs', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': None, 'encoding': 'UTF-8'}
2016-09-27 09:44:39,340 - Writing File['/usr/hdp/current/hadoop-client/conf/secure/ssl-client.xml'] because contents don't match
2016-09-27 09:44:39,341 - XmlConfig['ssl-server.xml'] {'owner': 'hdfs', 'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'configuration_attributes': {}, 'configurations': ...}
2016-09-27 09:44:39,354 - Generating config: /usr/hdp/current/hadoop-client/conf/ssl-server.xml
2016-09-27 09:44:39,354 - File['/usr/hdp/current/hadoop-client/conf/ssl-server.xml'] {'owner': 'hdfs', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': None, 'encoding': 'UTF-8'}
2016-09-27 09:44:39,363 - Writing File['/usr/hdp/current/hadoop-client/conf/ssl-server.xml'] because contents don't match
2016-09-27 09:44:39,364 - File['/usr/hdp/current/hadoop-client/conf/ssl-client.xml.example'] {'owner': 'mapred', 'group': 'hadoop'}
2016-09-27 09:44:39,364 - File['/usr/hdp/current/hadoop-client/conf/ssl-server.xml.example'] {'owner': 'mapred', 'group': 'hadoop'}
2016-09-27 09:44:39,366 - File['/var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid'] {'action': ['delete'], 'not_if': 'ls /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid >/dev/null 2>&1 && ps -p `cat /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid` >/dev/null 2>&1'}
2016-09-27 09:44:39,373 - Execute['ulimit -c unlimited; export HADOOP_LIBEXEC_DIR=/usr/hdp/current/hadoop-client/libexec && /usr/hdp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh --config /usr/hdp/current/hadoop-client/conf start nodemanager'] {'not_if': 'ls /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid >/dev/null 2>&1 && ps -p `cat /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid` >/dev/null 2>&1', 'user': 'yarn'}
2016-09-27 09:44:40,596 - Execute['ls /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid >/dev/null 2>&1 && ps -p `cat /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid` >/dev/null 2>&1'] {'not_if': 'ls /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid >/dev/null 2>&1 && ps -p `cat /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid` >/dev/null 2>&1', 'tries': 5, 'user': 'yarn', 'try_sleep': 1}
2016-09-27 09:44:40,798 - Skipping Execute['ls /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid >/dev/null 2>&1 && ps -p `cat /var/run/hadoop-yarn/yarn/yarn-yarn-nodemanager.pid` >/dev/null 2>&1'] due to not_if

chahrimourad · ‎09-27-2016

2016-09-27 09:44:38,711 - Directory['/var/run/hadoop-yarn'] {'owner': 'yarn', 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'}
2016-09-27 09:44:38,712 - Directory['/var/run/hadoop-yarn/yarn'] {'owner': 'yarn', 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'}
2016-09-27 09:44:38,713 - Directory['/var/log/hadoop-yarn/yarn'] {'owner': 'yarn', 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'}
2016-09-27 09:44:38,715 - Directory['/var/run/hadoop-mapreduce'] {'owner': 'mapred', 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'}
2016-09-27 09:44:38,717 - Directory['/var/run/hadoop-mapreduce/mapred'] {'owner': 'mapred', 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'}
2016-09-27 09:44:38,717 - Directory['/var/log/hadoop-mapreduce'] {'owner': 'mapred', 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'}
2016-09-27 09:44:38,718 - Directory['/var/log/hadoop-mapreduce/mapred'] {'owner': 'mapred', 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'}
2016-09-27 09:44:38,719 - Directory['/var/log/hadoop-yarn'] {'owner': 'yarn', 'ignore_failures': True, 'recursive': True, 'cd_access': 'a'}
2016-09-27 09:44:38,720 - XmlConfig['core-site.xml'] {'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'mode': 0644, 'configuration_attributes': {}, 'owner': 'hdfs', 'configurations': ...}
2016-09-27 09:44:38,752 - Generating config: /usr/hdp/current/hadoop-client/conf/core-site.xml
2016-09-27 09:44:38,752 - File['/usr/hdp/current/hadoop-client/conf/core-site.xml'] {'owner': 'hdfs', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': 0644, 'encoding': 'UTF-8'}
2016-09-27 09:44:38,779 - Writing File['/usr/hdp/current/hadoop-client/conf/core-site.xml'] because contents don't match
2016-09-27 09:44:38,780 - XmlConfig['hdfs-site.xml'] {'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'mode': 0644, 'configuration_attributes': {'final': {'dfs.datanode.data.dir': 'true'}}, 'owner': 'hdfs', 'configurations': ...}
2016-09-27 09:44:38,793 - Generating config: /usr/hdp/current/hadoop-client/conf/hdfs-site.xml
2016-09-27 09:44:38,793 - File['/usr/hdp/current/hadoop-client/conf/hdfs-site.xml'] {'owner': 'hdfs', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': 0644, 'encoding': 'UTF-8'}
2016-09-27 09:44:38,860 - Writing File['/usr/hdp/current/hadoop-client/conf/hdfs-site.xml'] because contents don't match
2016-09-27 09:44:38,861 - XmlConfig['mapred-site.xml'] {'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'mode': 0644, 'configuration_attributes': {}, 'owner': 'yarn', 'configurations': ...}
2016-09-27 09:44:38,874 - Generating config: /usr/hdp/current/hadoop-client/conf/mapred-site.xml
2016-09-27 09:44:38,874 - File['/usr/hdp/current/hadoop-client/conf/mapred-site.xml'] {'owner': 'yarn', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': 0644, 'encoding': 'UTF-8'}
2016-09-27 09:44:38,923 - Writing File['/usr/hdp/current/hadoop-client/conf/mapred-site.xml'] because contents don't match
2016-09-27 09:44:38,924 - Changing owner for /usr/hdp/current/hadoop-client/conf/mapred-site.xml from 501 to yarn
2016-09-27 09:44:38,924 - XmlConfig['yarn-site.xml'] {'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'mode': 0644, 'configuration_attributes': {}, 'owner': 'yarn', 'configurations': ...}
2016-09-27 09:44:38,937 - Generating config: /usr/hdp/current/hadoop-client/conf/yarn-site.xml
2016-09-27 09:44:38,937 - File['/usr/hdp/current/hadoop-client/conf/yarn-site.xml'] {'owner': 'yarn', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': 0644, 'encoding': 'UTF-8'}
2016-09-27 09:44:39,050 - Writing File['/usr/hdp/current/hadoop-client/conf/yarn-site.xml'] because contents don't match
2016-09-27 09:44:39,050 - XmlConfig['capacity-scheduler.xml'] {'group': 'hadoop', 'conf_dir': '/usr/hdp/current/hadoop-client/conf', 'mode': 0644, 'configuration_attributes': {}, 'owner': 'yarn', 'configurations': ...}
2016-09-27 09:44:39,063 - Generating config: /usr/hdp/current/hadoop-client/conf/capacity-scheduler.xml
2016-09-27 09:44:39,064 - File['/usr/hdp/current/hadoop-client/conf/capacity-scheduler.xml'] {'owner': 'yarn', 'content': InlineTemplate(...), 'group': 'hadoop', 'mode': 0644, 'encoding': 'UTF-8'}
2016-09-27 09:44:39,100 - Writing File['/usr/hdp/current/hadoop-client/conf/capacity-scheduler.xml'] because contents don't match
2016-09-27 09:44:39,101 - Changing owner for /usr/hdp/current/hadoop-client/conf/capacity-scheduler.xml from 506 to yarn
2016-09-27 09:44:39,101 - File['/etc/hadoop/conf/yarn.exclude'] {'owner': 'yarn', 'group': 'hadoop'}
2016-09-27 09:44:39,123 - File['/etc/security/limits.d/yarn.conf'] {'content': Template('yarn.conf.j2'), 'mode': 0644}
2016-09-27 09:44:39,127 - File['/etc/security/limits.d/mapreduce.conf'] {'content': Template('mapreduce.conf.j2'), 'mode': 0644}
2016-09-27 09:44:39,133 - File['/usr/hdp/current/hadoop-client/conf/yarn-env.sh'] {'content': InlineTemplate(...), 'owner': 'yarn', 'group': 'hadoop', 'mode': 0755}
2016-09-27 09:44:39,134 - Writing File['/usr/hdp/current/hadoop-client/conf/yarn-env.sh'] because contents don't match
2016-09-27 09:44:39,135 - File['/usr/hdp/current/hadoop-yarn-nodemanager/bin/container-executor'] {'group': 'hadoop', 'mode': 02050}
2016-09-27 09:44:39,143 - File['/usr/hdp/current/hadoop-client/conf/container-executor.cfg'] {'content': Template('container-executor.cfg.j2'), 'group': 'hadoop', 'mode': 0644}
2016-09-27 09:44:39,148 - Directory['/cgroups_test/cpu'] {'mode': 0755, 'group': 'hadoop', 'recursive': True, 'cd_access': 'a'}

iroberts · ‎09-27-2016

@Mourad Chahri You can go to the ResourceManager UI. From there you should see a nodes link on the left side of the screen. If you click on that, you should see all of your NodeManagers and the reason for it being listed as unhealthy may be shown here. It is most likely due to yarn local dirs or log dirs. You may be hitting the disk threshold for this. There are a couple of parameters you can check for this.

yarn.nodemanager.disk-health-checker.min-healthy-disks

yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage

yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb

Finally, if that does not reveal the issue, you should look in /var/log/hadoop-yarn/yarn. Your previous comment shows you were looking in /var/log/hadoop/yarn which is not where the NodeManager log is located.

I hope this helps.