Support Questions

Johnny_Bach · ‎03-22-2018

YARN Resource Manager Halts with the OOM : Unable to create native thread and the Job fails over to standby Resource Manager in completing the Task.

Could you please let us know the root cause of the issue

ERROR Message :

2018-03-22 02:30:09,637 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e189_1521451854044_2288_01_000002 Container Transitioned from ALLOCATED to ACQUIRED

2018-03-22 02:30:10,413 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e189_1521451854044_2288_01_000002 Container Transitioned from ACQUIRED to RUNNING

2018-03-22 02:30:10,695 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: checking for deactivate...

2018-03-22 02:30:19,354 INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: hue is accessing unchecked http://server1:43045/ws/v1/mapreduce/jobs/job_1521451854044_2288 which is the app master GUI of application_1521451854044_2288 owned by edh_srv_prod

2018-03-22 02:30:30,212 INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: hue is accessing unchecked http://server1:43045/ws/v1/mapreduce/jobs/job_1521451854044_2288 which is the app master GUI of application_1521451854044_2288 owned by edh_srv_prod

2018-03-22 02:30:34,090 FATAL org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[2101925946@qtp-1878992188-14302,5,main] threw an Error. Shutting down now...

java.lang.OutOfMemoryError: unable to create new native thread

at java.lang.Thread.start0(Native Method)

at java.lang.Thread.start(Thread.java:714)

at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1095)

at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)

at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)

at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)

at org.mortbay.jetty.security.SslSocketConnector$SslConnection.run(SslSocketConnector.java:723)

at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

2018-03-22 02:30:34,093 INFO org.apache.hadoop.util.ExitUtil: Halt with status -1 Message: HaltException

yarn application -status application_1521451854044_2288

Application Report :

Application-Id : application_1521451854044_2288

Application-Name : oozie:launcher:T=shell:W=OS_Changes_incremental_workflow:A=shell-b8b2:ID=0006766-180222181315002-oozie-oozi-W

Application-Type : MAPREDUCE

User : edh_srv_prod

Queue : root.edh_srv_prod

Start-Time : 1521710999557

Finish-Time : 1521711593154

Progress : 100%

State : FINISHED

Final-State : SUCCEEDED

Tracking-URL : https://server1:19890/jobhistory/job/job_1521451854044_2288

RPC Port : 40930

AM Host : server3

Aggregate Resource Allocation : 1809548 MB-seconds, 1181 vcore-seconds

Log Aggregation Status : SUCCEEDED

Diagnostics : Attempt recovered after RM restart

Harsh J · ‎03-23-2018

What CDH version are you using? If it is equal to or lower than 5.9.1 or 5.8.3, and you use a KMS service in the cluster (for HDFS Transparent Encryption Zone features), you may be hitting https://issues.apache.org/jira/browse/HADOOP-13838, which has been fixed in the bug-fix releases of CDH 5.8.4, 5.9.2, and 5.10.0 onwards.

View solution in original post

Harsh J · ‎03-23-2018

Thank you for confirming the CDH version. Do you also have a KMS service in the cluster? If yes, you're definitely hitting the aforementioned bug.

You're partially right about "OS running out of PID". More specifically, the YARN RM process runs into its 'no. of processes' (nproc) ulimit, which should be set to a high default (32k processes) if you are running Cloudera Manager. There's no reason YARN should normally be using threads counting upto 32k.

View solution in original post

Harsh J · ‎03-23-2018

What CDH version are you using? If it is equal to or lower than 5.9.1 or 5.8.3, and you use a KMS service in the cluster (for HDFS Transparent Encryption Zone features), you may be hitting https://issues.apache.org/jira/browse/HADOOP-13838, which has been fixed in the bug-fix releases of CDH 5.8.4, 5.9.2, and 5.10.0 onwards.

Johnny_Bach · ‎03-23-2018

i'm currently on CDH - 5.8.2 with KMS service, but what's your thought about OS running out of PID as the error message suggests to be likely so?

Harsh J · ‎03-23-2018

Thank you for confirming the CDH version. Do you also have a KMS service in the cluster? If yes, you're definitely hitting the aforementioned bug.

You're partially right about "OS running out of PID". More specifically, the YARN RM process runs into its 'no. of processes' (nproc) ulimit, which should be set to a high default (32k processes) if you are running Cloudera Manager. There's no reason YARN should normally be using threads counting upto 32k.

Johnny_Bach · ‎03-24-2018

Yes, we do have an KMS service in the cluster.Thanks for providing an clarity on "OS running out of PID"

puny_reborn · ‎06-20-2018

we are using CDH 5.14.0,I found our components [hdfs,yarn,hbase] would restart because of the same issue. the exception like this :

java.io.IOException: Cannot run program "stat": error=11, Resource temporarily unavailable
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:551)
at org.apache.hadoop.util.Shell.run(Shell.java:507)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)
at org.apache.hadoop.fs.HardLink.getLinkCount(HardLink.java:218)
at org.apache.hadoop.hdfs.server.datanode.ReplicaInfo.breakHardLinksIfNeeded(ReplicaInfo.java:265)
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.append(FsDatasetImpl.java:1177)
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.append(FsDatasetImpl.java:1148)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:210)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:675)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: error=11, Resource temporarily unavailable
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 13 more

2018-06-20 02:05:54,797 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode is out of memory. Will retry in 30 seconds.
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:154)
at java.lang.Thread.run(Thread.java:748)

alse,I noted cloudera manager help us set the ulimit. here is our config:

if [ $(id -u) -eq 0 ]; then
# Max number of open files
ulimit -n 32768

# Max number of child processes and threads
ulimit -u 65536

# Max locked memory
ulimit -l unlimited
fi

ps: our machine is 72c 250g. could you help me that what the reason causes create native thread failed?

Cloudera Community

Support Questions

Yarn Resource Manager Halts with java.lang.OutOfMemoryError: unable to create new native thread