Support Questions

ScottChris · ‎12-13-2016

# uname -a
Linux hostname 2.6.32-642.6.2.el6.x86_64 #1 SMP Mon Oct 24 10:22:33 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux

Version: Cloudera Express 5.8.2 (#17 built by jenkins on 20160916-1426 git: d23c620f3a3bbd85d8511d6ebba49beaaab14b75)

CDH 5 5.8.2-1.cdh5.8.2.p0.3 Distributed, Activated

While reconfiguring for high availability now 4 of 5 NodeManagers won't start.

there is no stderr file.

http://pastebin.com/iu4hR03Q

http://pastebin.com/8StbBsj4

I've tried removing the roles, then re-adding the roles.

deleted all files in:

/var/lib/hadoop-yarn/yarn-nm-recovery/

/var/yarn/

confirmed owners matched working node.

no luck so far.

ScottChris · ‎01-04-2017

found two values for the search "java configuration options for nodemanager"

copy / paste to make them same

(we added jmx parameters)

this seems to have fixed it.

needs verification.

View solution in original post

EricL · ‎12-13-2016

Hi,

Based on the error from NM log, I can see the following

Dec 13, 3:43:51.255 PM  ERROR   org.apache.hadoop.yarn.server.nodemanager.NodeManager  
RECEIVED SIGNAL 15: SIGTERM

This looks like that it is killed by some other process, but not sure what. Do you use Cloudera Manager? If yes, can you check the logs under /var/log/cloudera-scm-agent and see if you can find any clues?

You can also check under /var/run/cloudera-scm-agent/process/*NODEMANAGER*, find one that was the most recent one and check for "logs" directory for logs.

Regards

ScottChris · ‎12-14-2016

Thanks for the quick response.

# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.7 (Santiago)

# uname -a
Linux hostname 2.6.32-642.6.2.el6.x86_64 #1 SMP Mon Oct 24 10:22:33 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux

Cloudera Manager version:
Version: Cloudera Express 5.8.2 (#17 built by jenkins on 20160916-1426 git: d23c620f3a3bbd85d8511d6ebba49beaaab14b75)

CDH Parcel version:
CDH 5 5.8.2-1.cdh5.8.2.p0.3 Distributed, Activated

# cat /var/log/hadoop-yarn/hadoop-cmf-yarn-NODEMANAGER-hostname.log.out
http://pastebin.com/iu4hR03Q

==> we assume the SIGTERM is caused by Cloudera Agent (perhaps giving up waiting on some indication NM is running properly)

# cat /var/log/cloudera-scm-agent/cloudera-scm-agent.out
http://pastebin.com/8StbBsj4

==> there are errors in here ('ValueError: dictionary update sequence element #25 has length 1; 2 is required' and 'MainThread agent ERROR Failed to activate ')

There is no 'logs' directory within the process/*NODEMANAGER* (so no stderr to be found):

# find /var/run/cloudera-scm-agent/process | grep 'logs\|NODEMANAGER'
/var/run/cloudera-scm-agent/process/573-zookeeper-server/logs
/var/run/cloudera-scm-agent/process/573-zookeeper-server/logs/stderr.log
/var/run/cloudera-scm-agent/process/573-zookeeper-server/logs/stdout.log
/var/run/cloudera-scm-agent/process/585-hdfs-DATANODE/logs
/var/run/cloudera-scm-agent/process/585-hdfs-DATANODE/logs/stderr.log
/var/run/cloudera-scm-agent/process/585-hdfs-DATANODE/logs/stdout.log
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/cloudera-monitor.properties
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/cloudera-stack-monitor.properties
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/container-executor.cfg
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/core-site.xml
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/event-filter-rules.json
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/hadoop-metrics2.properties
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/hadoop-policy.xml
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/hdfs-site.xml
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/http-auth-signature-secret
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/log4j.properties
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/mapred-site.xml
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/redaction-rules.json
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/ssl-client.xml
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/ssl-server.xml
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/topology.map
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/topology.py
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/yarn.keytab
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/yarn-site.xml
/var/run/cloudera-scm-agent/process/604-impala-IMPALAD/logs
/var/run/cloudera-scm-agent/process/604-impala-IMPALAD/logs/stderr.log
/var/run/cloudera-scm-agent/process/604-impala-IMPALAD/logs/stdout.log

So we are using Cloudera Manager.

When restarting the cluster, Yarn fails to start but all other services start OK.

When we drill into the Yarn 'instances' we find:

JobHistory Server running

ResourceManager (Active) running

ResourceManager (Standby) running

NodeManger (running)

NodeManger (stopped)

with Status 4 "Down" and 4 "Good Health"

if we select one of the stopped instances of NodeManager, then attempt to manually start it the above pastebin logs are what we see.

there is no log directory created, and thus no stderr.

Cloudera Manager waits for it to start but eventually marks it failed.

We are planning to deploy HA to production, and this is our test run on the QA lab system.

This failure is now blocking us from proceeding with our production HA deployment.

Frankly we don't even use Yarn (or MapReduce). At this point we only use HDFS and Impala.

Yarn seems to be a dependency for Hive and Impala.

If we are not using Yarn/MR and we can decomission these 4 failed NM instances, can the system run with a single HA pair of RM with just one instance of NM?

(it would at least make Cloudera Manger happy with green status and no failures upon cluster restarts)

Thanks.

Wilfred · ‎12-22-2016

The issue you have is due to the empty line in theYARN_NODEMANAGER_OPTS variable. That empty line should not be there and is causing the split to fail.

Can you check what is in the "Java Configuration Options for NodeManager" in Cloudera Manager and make sure that there is no empty line in that config?

Thanks,

Wilfred

ScottChris · ‎01-04-2017

thanks for your reply.

we have 5 nodes configured to run NodeManager. 1 works, but 4 fails.

if the "Java Configuration Options for NodeManager" was an empty string then none should start, correct?

its not empty.

please if you have other ideas we would appreciate it.

thanks

ScottChris · ‎01-04-2017

found two values for the search "java configuration options for nodemanager"

copy / paste to make them same

(we added jmx parameters)

this seems to have fixed it.

needs verification.

Wilfred · ‎01-04-2017

Good to hear that you have found the discrepancy between the nodes and have been able to fix it. I reproduced your issue internally on Cloudera Manager 5.9

I have logged an internal jira to fix CM and not allow you to properly handle adding the empty line in the options.

Wilfred

ScottChris · ‎01-06-2017

The root cause seems to be that there are two 'textarea' boxes for the parameter 'java configuration options for nodemanager' and if these do not contain the same value, then nodemanager will not start.

these are the two boxes:

NODEMANAGER Imported From: TaskTracker (1)
NODEMANAGER Imported From: TaskTracker Default Group

Shouldn't Cloudera Manager not allow this condition to exist, or protect the user from this happening in the first place.

Thanks.

(the system at the JVM very well might be it receives an empty string for this parameter when these two do not match, just a guess)

Wilfred · ‎01-08-2017

You can have differences between the options for the NMs that is not the problem. It could be that the difference in HW used in the NMs requires a different JVM option to be set so it is something that we allow and will also work.

However there can not be an empty line in the options. An empty line in the options is passed on to the settings in the script to set the environment etc. That is where it breaks. The empty line breaks the setting into two in the script which should not happen. The empty line(s) should be trimmed before we generate that settings script, which is the jira I filed.

Wilfred

Cloudera Community

Support Questions

Unable to start YARN - Error starting NodeManagers