Support Questions

Find answers, ask questions, and share your expertise

Unable to start YARN - Error starting NodeManagers

avatar
Contributor

# uname -a
Linux hostname 2.6.32-642.6.2.el6.x86_64 #1 SMP Mon Oct 24 10:22:33 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux

 

Version: Cloudera Express 5.8.2 (#17 built by jenkins on 20160916-1426 git:  d23c620f3a3bbd85d8511d6ebba49beaaab14b75)

 

CDH 5 5.8.2-1.cdh5.8.2.p0.3 Distributed, Activated

 

While reconfiguring for high availability now 4 of 5 NodeManagers won't start.

 

there is no stderr file.

 

http://pastebin.com/iu4hR03Q

http://pastebin.com/8StbBsj4

 

I've tried removing the roles, then re-adding the roles.

 

deleted all files in:

/var/lib/hadoop-yarn/yarn-nm-recovery/

/var/yarn/

 

confirmed owners matched working node.

 

no luck so far.

1 ACCEPTED SOLUTION

avatar
Contributor

found two values for the search "java configuration options for nodemanager"

 

copy / paste to make them same

 

(we added jmx parameters)

 

this seems to have fixed it.

 

needs verification.

View solution in original post

8 REPLIES 8

avatar
Super Guru

Hi,


Based on the error from NM log, I can see the following

 

 

 
Dec 13, 3:43:51.255 PM  ERROR   org.apache.hadoop.yarn.server.nodemanager.NodeManager  
RECEIVED SIGNAL 15: SIGTERM
 
 
This looks like that it is killed by some other process, but not sure what. Do you use Cloudera Manager? If yes, can you check the logs under  /var/log/cloudera-scm-agent and see if you can find any clues?
 
You can also check under /var/run/cloudera-scm-agent/process/*NODEMANAGER*, find one that was the most recent one and check for "logs" directory for logs.
 
Regards
 

avatar
Contributor

Thanks for the quick response.

# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.7 (Santiago)

# uname -a
Linux hostname 2.6.32-642.6.2.el6.x86_64 #1 SMP Mon Oct 24 10:22:33 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux

Cloudera Manager version:
Version: Cloudera Express 5.8.2 (#17 built by jenkins on 20160916-1426 git: d23c620f3a3bbd85d8511d6ebba49beaaab14b75)

CDH Parcel version:
CDH 5 5.8.2-1.cdh5.8.2.p0.3 Distributed, Activated

# cat /var/log/hadoop-yarn/hadoop-cmf-yarn-NODEMANAGER-hostname.log.out
http://pastebin.com/iu4hR03Q

==> we assume the SIGTERM is caused by Cloudera Agent (perhaps giving up waiting on some indication NM is running properly)

# cat /var/log/cloudera-scm-agent/cloudera-scm-agent.out
http://pastebin.com/8StbBsj4

==> there are errors in here ('ValueError: dictionary update sequence element #25 has length 1; 2 is required' and 'MainThread agent ERROR Failed to activate ')


There is no 'logs' directory within the process/*NODEMANAGER* (so no stderr to be found):

# find /var/run/cloudera-scm-agent/process | grep 'logs\|NODEMANAGER'
/var/run/cloudera-scm-agent/process/573-zookeeper-server/logs
/var/run/cloudera-scm-agent/process/573-zookeeper-server/logs/stderr.log
/var/run/cloudera-scm-agent/process/573-zookeeper-server/logs/stdout.log
/var/run/cloudera-scm-agent/process/585-hdfs-DATANODE/logs
/var/run/cloudera-scm-agent/process/585-hdfs-DATANODE/logs/stderr.log
/var/run/cloudera-scm-agent/process/585-hdfs-DATANODE/logs/stdout.log
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/cloudera-monitor.properties
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/cloudera-stack-monitor.properties
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/container-executor.cfg
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/core-site.xml
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/event-filter-rules.json
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/hadoop-metrics2.properties
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/hadoop-policy.xml
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/hdfs-site.xml
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/http-auth-signature-secret
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/log4j.properties
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/mapred-site.xml
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/redaction-rules.json
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/ssl-client.xml
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/ssl-server.xml
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/topology.map
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/topology.py
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/yarn.keytab
/var/run/cloudera-scm-agent/process/593-yarn-NODEMANAGER/yarn-site.xml
/var/run/cloudera-scm-agent/process/604-impala-IMPALAD/logs
/var/run/cloudera-scm-agent/process/604-impala-IMPALAD/logs/stderr.log
/var/run/cloudera-scm-agent/process/604-impala-IMPALAD/logs/stdout.log

 

So we are using Cloudera Manager.

 

When restarting the cluster, Yarn fails to start but all other services start OK.

 

When we drill into the Yarn 'instances' we find:

JobHistory Server running

ResourceManager (Active) running

ResourceManager (Standby) running

NodeManger (running)

NodeManger (stopped)

NodeManger (stopped)

NodeManger (stopped)

NodeManger (stopped)

 

with Status 4 "Down" and 4 "Good Health"

 

if we select one of the stopped instances of NodeManager, then attempt to manually start it the above pastebin logs are what we see.

 

there is no log directory created, and thus no stderr.

 

Cloudera Manager waits for it to start but eventually marks it failed.

 

We are planning to deploy HA to production, and this is our test run on the QA lab system.

 

This failure is now blocking us from proceeding with our production HA deployment.

 

Frankly we don't even use Yarn (or MapReduce).  At this point we only use HDFS and Impala.

 

Yarn seems to be a dependency for Hive and Impala.

 

If we are not using Yarn/MR and we can decomission these 4 failed NM instances, can the system run with a single HA pair of RM with just one instance of NM?

 

(it would at least make Cloudera Manger happy with green status and no failures upon cluster restarts)

 

Thanks.

avatar
Super Collaborator

The issue you have is due to the empty line in theYARN_NODEMANAGER_OPTS variable. That empty line should not be there and is causing the split to fail.

Can you check what is in the "Java Configuration Options for NodeManager" in Cloudera Manager and make sure that there is no empty line in that config?

 

Thanks,

Wilfred

 

avatar
Contributor

thanks for your reply.

 

we have 5 nodes configured to run NodeManager.  1 works, but 4 fails.

 

if the "Java Configuration Options for NodeManager" was an empty string then none should start, correct?

 

its not empty.

 

please if you have other ideas we would appreciate it.

 

thanks

avatar
Contributor

found two values for the search "java configuration options for nodemanager"

 

copy / paste to make them same

 

(we added jmx parameters)

 

this seems to have fixed it.

 

needs verification.

avatar
Super Collaborator

Good to hear that you have found the discrepancy between the nodes and have been able to fix it. I reproduced your issue internally on Cloudera Manager 5.9

I have logged an internal jira to fix CM and not allow you to properly handle adding the empty line in the options.

 

Wilfred

avatar
Contributor

The root cause seems to be that there are two 'textarea' boxes for the parameter 'java configuration options for nodemanager' and if these do not contain the same value, then nodemanager will not start.

 

these are the two boxes:

 

NODEMANAGER Imported From: TaskTracker (1)
NODEMANAGER Imported From: TaskTracker Default Group

 

 

Shouldn't Cloudera Manager not allow this condition to exist, or protect the user from this happening in the first place.

 

Thanks.

 

(the system at the JVM very well might be it receives an empty string for this parameter when these two do not match, just a guess)

avatar
Super Collaborator

You can have differences between the options for the NMs that is not the problem. It could be that the difference in HW used in the NMs requires a different JVM option to be set so it is something that we allow and will also work.

 

However there can not be an empty line in the options. An empty line in the options is passed on to the settings in the script to set the environment etc. That is where it breaks. The empty line breaks the setting into two in the script which should not happen. The empty line(s) should be trimmed before we generate that settings script, which is the jira I filed.

 

Wilfred