Support Questions

Find answers, ask questions, and share your expertise

Yarn Resource Manager Halts with java.lang.OutOfMemoryError: unable to create new native thread

avatar
Contributor

YARN Resource Manager Halts with the OOM : Unable to create native thread and the Job fails over to standby Resource Manager in completing the Task.

 

Could you please let us know the root cause of the issue

 

ERROR Message :

 

2018-03-22 02:30:09,637 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e189_1521451854044_2288_01_000002 Container Transitioned from ALLOCATED to ACQUIRED

2018-03-22 02:30:10,413 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e189_1521451854044_2288_01_000002 Container Transitioned from ACQUIRED to RUNNING

2018-03-22 02:30:10,695 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: checking for deactivate... 

2018-03-22 02:30:19,354 INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: hue is accessing unchecked http://server1:43045/ws/v1/mapreduce/jobs/job_1521451854044_2288 which is the app master GUI of application_1521451854044_2288 owned by edh_srv_prod

2018-03-22 02:30:30,212 INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: hue is accessing unchecked http://server1:43045/ws/v1/mapreduce/jobs/job_1521451854044_2288 which is the app master GUI of application_1521451854044_2288 owned by edh_srv_prod

2018-03-22 02:30:34,090 FATAL org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[2101925946@qtp-1878992188-14302,5,main] threw an Error.  Shutting down now...

java.lang.OutOfMemoryError: unable to create new native thread

               at java.lang.Thread.start0(Native Method)

               at java.lang.Thread.start(Thread.java:714)

               at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1095)

               at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)

               at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)

               at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)

               at org.mortbay.jetty.security.SslSocketConnector$SslConnection.run(SslSocketConnector.java:723)

               at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

2018-03-22 02:30:34,093 INFO org.apache.hadoop.util.ExitUtil: Halt with status -1 Message: HaltException

 

 

yarn application -status application_1521451854044_2288

Application Report : 

               Application-Id : application_1521451854044_2288

               Application-Name : oozie:launcher:T=shell:W=OS_Changes_incremental_workflow:A=shell-b8b2:ID=0006766-180222181315002-oozie-oozi-W

               Application-Type : MAPREDUCE

               User : edh_srv_prod

               Queue : root.edh_srv_prod

               Start-Time : 1521710999557

               Finish-Time : 1521711593154

               Progress : 100%

               State : FINISHED

               Final-State : SUCCEEDED

               Tracking-URL : https://server1:19890/jobhistory/job/job_1521451854044_2288

               RPC Port : 40930

               AM Host : server3

               Aggregate Resource Allocation : 1809548 MB-seconds, 1181 vcore-seconds

               Log Aggregation Status : SUCCEEDED

               Diagnostics : Attempt recovered after RM restart

 

2 ACCEPTED SOLUTIONS

avatar
Mentor
What CDH version are you using? If it is equal to or lower than 5.9.1 or 5.8.3, and you use a KMS service in the cluster (for HDFS Transparent Encryption Zone features), you may be hitting https://issues.apache.org/jira/browse/HADOOP-13838, which has been fixed in the bug-fix releases of CDH 5.8.4, 5.9.2, and 5.10.0 onwards.

View solution in original post

avatar
Mentor
Thank you for confirming the CDH version. Do you also have a KMS service in the cluster? If yes, you're definitely hitting the aforementioned bug.

You're partially right about "OS running out of PID". More specifically, the YARN RM process runs into its 'no. of processes' (nproc) ulimit, which should be set to a high default (32k processes) if you are running Cloudera Manager. There's no reason YARN should normally be using threads counting upto 32k.

View solution in original post

5 REPLIES 5

avatar
Mentor
What CDH version are you using? If it is equal to or lower than 5.9.1 or 5.8.3, and you use a KMS service in the cluster (for HDFS Transparent Encryption Zone features), you may be hitting https://issues.apache.org/jira/browse/HADOOP-13838, which has been fixed in the bug-fix releases of CDH 5.8.4, 5.9.2, and 5.10.0 onwards.

avatar
Contributor

i'm currently on CDH - 5.8.2 with KMS service, but what's your thought about OS running out of PID  as the error message suggests to be likely so?

 

 

avatar
Mentor
Thank you for confirming the CDH version. Do you also have a KMS service in the cluster? If yes, you're definitely hitting the aforementioned bug.

You're partially right about "OS running out of PID". More specifically, the YARN RM process runs into its 'no. of processes' (nproc) ulimit, which should be set to a high default (32k processes) if you are running Cloudera Manager. There's no reason YARN should normally be using threads counting upto 32k.

avatar
Contributor
Yes, we do have an KMS service in the cluster.Thanks for providing an clarity on "OS running out of PID"

avatar
New Contributor

we are using CDH 5.14.0,I found our components [hdfs,yarn,hbase] would restart because of the same issue. the exception like this :

 

java.io.IOException: Cannot run program "stat": error=11, Resource temporarily unavailable
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:551)
at org.apache.hadoop.util.Shell.run(Shell.java:507)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)
at org.apache.hadoop.fs.HardLink.getLinkCount(HardLink.java:218)
at org.apache.hadoop.hdfs.server.datanode.ReplicaInfo.breakHardLinksIfNeeded(ReplicaInfo.java:265)
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.append(FsDatasetImpl.java:1177)
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.append(FsDatasetImpl.java:1148)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:210)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:675)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: error=11, Resource temporarily unavailable
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 13 more

 

2018-06-20 02:05:54,797 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode is out of memory. Will retry in 30 seconds.
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:154)
at java.lang.Thread.run(Thread.java:748)

 

 

alse,I noted cloudera manager help us set the ulimit. here is our config:

 

if [ $(id -u) -eq 0 ]; then
# Max number of open files
ulimit -n 32768

# Max number of child processes and threads
ulimit -u 65536

# Max locked memory
ulimit -l unlimited
fi

 

ps: our machine is 72c 250g. could you help me that what the reason causes create native thread failed?