Created on 12-13-2016 06:49 PM - edited 09-16-2022 03:51 AM
# uname -a
Linux hostname 2.6.32-642.6.2.el6.x86_64 #1 SMP Mon Oct 24 10:22:33 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Version: Cloudera Express 5.8.2 (#17 built by jenkins on 20160916-1426 git: d23c620f3a3bbd85d8511d6ebba49beaaab14b75)
CDH 5 5.8.2-1.cdh5.8.2.p0.3 Distributed, Activated
While reconfiguring for high availability now 4 of 5 NodeManagers won't start.
there is no stderr file.
I've tried removing the roles, then re-adding the roles.
deleted all files in:
/var/lib/hadoop-yarn/yarn-nm-recovery/
/var/yarn/
confirmed owners matched working node.
no luck so far.
Created 01-04-2017 02:58 PM
found two values for the search "java configuration options for nodemanager"
copy / paste to make them same
(we added jmx parameters)
this seems to have fixed it.
needs verification.
Created 12-13-2016 08:22 PM
Hi,
Based on the error from NM log, I can see the following
Dec 13, 3:43:51.255 PM ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager RECEIVED SIGNAL 15: SIGTERM
Created 12-14-2016 09:57 AM
Thanks for the quick response.
# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.7 (Santiago)
# uname -a
Linux hostname 2.6.32-642.6.2.el6.x86_64 #1 SMP Mon Oct 24 10:22:33 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Cloudera Manager version:
Version: Cloudera Express 5.8.2 (#17 built by jenkins on 20160916-1426 git: d23c620f3a3bbd85d8511d6ebba49beaaab14b75)
CDH Parcel version:
CDH 5 5.8.2-1.cdh5.8.2.p0.3 Distributed, Activated
# cat /var/log/hadoop-yarn/hadoop-cmf-yarn-NODEMANAGER-hostname.log.out
http://pastebin.com/iu4hR03Q
==> we assume the SIGTERM is caused by Cloudera Agent (perhaps giving up waiting on some indication NM is running properly)
# cat /var/log/cloudera-scm-agent/cloudera-scm-agent.out
http://pastebin.com/8StbBsj4
==> there are errors in here ('ValueError: dictionary update sequence element #25 has length 1; 2 is required' and 'MainThread agent ERROR Failed to activate ')
There is no 'logs' directory within the process/*NODEMANAGER* (so no stderr to be found):
# find /var/run/cloudera-scm-agent/process | grep 'logs\|NODEMANAGER'
/var/run/cloudera-scm-agent/process/573-zookeeper-server/logs
/var/run/cloudera-scm-agent/process/573-zookeeper-server/logs/stderr.log
/var/run/cloudera-scm-agent/process/573-zookeeper-server/logs/stdout.log
/var/run/cloudera-scm-agent/process/585-hdfs-DATANODE/logs
/var/run/cloudera-scm-agent/process/585-hdfs-DATANODE/logs/stderr.log
/var/run/cloudera-scm-agent/process/585-hdfs-DATANODE/logs/stdout.log
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/cloudera-monitor.properties
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/cloudera-stack-monitor.properties
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/container-executor.cfg
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/core-site.xml
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/event-filter-rules.json
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/hadoop-metrics2.properties
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/hadoop-policy.xml
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/hdfs-site.xml
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/http-auth-signature-secret
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/log4j.properties
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/mapred-site.xml
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/redaction-rules.json
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/ssl-client.xml
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/ssl-server.xml
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/topology.map
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/topology.py
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/yarn.keytab
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/yarn-site.xml
/var/run/cloudera-scm-agent/process/604-impala-IMPALAD/logs
/var/run/cloudera-scm-agent/process/604-impala-IMPALAD/logs/stderr.log
/var/run/cloudera-scm-agent/process/604-impala-IMPALAD/logs/stdout.log
So we are using Cloudera Manager.
When restarting the cluster, Yarn fails to start but all other services start OK.
When we drill into the Yarn 'instances' we find:
JobHistory Server running
ResourceManager (Active) running
ResourceManager (Standby) running
NodeManger (running)
NodeManger (stopped)
NodeManger (stopped)
NodeManger (stopped)
NodeManger (stopped)
with Status 4 "Down" and 4 "Good Health"
if we select one of the stopped instances of NodeManager, then attempt to manually start it the above pastebin logs are what we see.
there is no log directory created, and thus no stderr.
Cloudera Manager waits for it to start but eventually marks it failed.
We are planning to deploy HA to production, and this is our test run on the QA lab system.
This failure is now blocking us from proceeding with our production HA deployment.
Frankly we don't even use Yarn (or MapReduce). At this point we only use HDFS and Impala.
Yarn seems to be a dependency for Hive and Impala.
If we are not using Yarn/MR and we can decomission these 4 failed NM instances, can the system run with a single HA pair of RM with just one instance of NM?
(it would at least make Cloudera Manger happy with green status and no failures upon cluster restarts)
Thanks.
Created 12-22-2016 04:56 PM
The issue you have is due to the empty line in theYARN_NODEMANAGER_OPTS variable. That empty line should not be there and is causing the split to fail.
Can you check what is in the "Java Configuration Options for NodeManager" in Cloudera Manager and make sure that there is no empty line in that config?
Thanks,
Wilfred
Created 01-04-2017 02:40 PM
thanks for your reply.
we have 5 nodes configured to run NodeManager. 1 works, but 4 fails.
if the "Java Configuration Options for NodeManager" was an empty string then none should start, correct?
its not empty.
please if you have other ideas we would appreciate it.
thanks
Created 01-04-2017 02:58 PM
found two values for the search "java configuration options for nodemanager"
copy / paste to make them same
(we added jmx parameters)
this seems to have fixed it.
needs verification.
Created 01-04-2017 07:07 PM
Good to hear that you have found the discrepancy between the nodes and have been able to fix it. I reproduced your issue internally on Cloudera Manager 5.9
I have logged an internal jira to fix CM and not allow you to properly handle adding the empty line in the options.
Wilfred
Created 01-06-2017 09:35 AM
The root cause seems to be that there are two 'textarea' boxes for the parameter 'java configuration options for nodemanager' and if these do not contain the same value, then nodemanager will not start.
these are the two boxes:
NODEMANAGER Imported From: TaskTracker (1)
NODEMANAGER Imported From: TaskTracker Default Group
Shouldn't Cloudera Manager not allow this condition to exist, or protect the user from this happening in the first place.
Thanks.
(the system at the JVM very well might be it receives an empty string for this parameter when these two do not match, just a guess)
Created 01-08-2017 04:03 PM
You can have differences between the options for the NMs that is not the problem. It could be that the difference in HW used in the NMs requires a different JVM option to be set so it is something that we allow and will also work.
However there can not be an empty line in the options. An empty line in the options is passed on to the settings in the script to set the environment etc. That is where it breaks. The empty line breaks the setting into two in the script which should not happen. The empty line(s) should be trimmed before we generate that settings script, which is the jira I filed.
Wilfred