Created on 03-22-2018 11:47 PM - edited 09-16-2022 06:00 AM
YARN Resource Manager Halts with the OOM : Unable to create native thread and the Job fails over to standby Resource Manager in completing the Task.
Could you please let us know the root cause of the issue
ERROR Message :
2018-03-22 02:30:09,637 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e189_1521451854044_2288_01_000002 Container Transitioned from ALLOCATED to ACQUIRED
2018-03-22 02:30:10,413 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e189_1521451854044_2288_01_000002 Container Transitioned from ACQUIRED to RUNNING
2018-03-22 02:30:10,695 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: checking for deactivate...
2018-03-22 02:30:19,354 INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: hue is accessing unchecked http://server1:43045/ws/v1/mapreduce/jobs/job_1521451854044_2288 which is the app master GUI of application_1521451854044_2288 owned by edh_srv_prod
2018-03-22 02:30:30,212 INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: hue is accessing unchecked http://server1:43045/ws/v1/mapreduce/jobs/job_1521451854044_2288 which is the app master GUI of application_1521451854044_2288 owned by edh_srv_prod
2018-03-22 02:30:34,090 FATAL org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[2101925946@qtp-1878992188-14302,5,main] threw an Error. Shutting down now...
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1095)
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
at org.mortbay.jetty.security.SslSocketConnector$SslConnection.run(SslSocketConnector.java:723)
at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
2018-03-22 02:30:34,093 INFO org.apache.hadoop.util.ExitUtil: Halt with status -1 Message: HaltException
yarn application -status application_1521451854044_2288
Application Report :
Application-Id : application_1521451854044_2288
Application-Name : oozie:launcher:T=shell:W=OS_Changes_incremental_workflow:A=shell-b8b2:ID=0006766-180222181315002-oozie-oozi-W
Application-Type : MAPREDUCE
User : edh_srv_prod
Queue : root.edh_srv_prod
Start-Time : 1521710999557
Finish-Time : 1521711593154
Progress : 100%
State : FINISHED
Final-State : SUCCEEDED
Tracking-URL : https://server1:19890/jobhistory/job/job_1521451854044_2288
RPC Port : 40930
AM Host : server3
Aggregate Resource Allocation : 1809548 MB-seconds, 1181 vcore-seconds
Log Aggregation Status : SUCCEEDED
Diagnostics : Attempt recovered after RM restart
Created 03-23-2018 10:48 PM
Created 03-23-2018 11:30 PM
Created 03-23-2018 10:48 PM
Created on 03-23-2018 11:25 PM - edited 03-23-2018 11:27 PM
i'm currently on CDH - 5.8.2 with KMS service, but what's your thought about OS running out of PID as the error message suggests to be likely so?
Created 03-23-2018 11:30 PM
Created 03-24-2018 12:08 AM
Created 06-20-2018 12:30 AM
we are using CDH 5.14.0,I found our components [hdfs,yarn,hbase] would restart because of the same issue. the exception like this :
java.io.IOException: Cannot run program "stat": error=11, Resource temporarily unavailable
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:551)
at org.apache.hadoop.util.Shell.run(Shell.java:507)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)
at org.apache.hadoop.fs.HardLink.getLinkCount(HardLink.java:218)
at org.apache.hadoop.hdfs.server.datanode.ReplicaInfo.breakHardLinksIfNeeded(ReplicaInfo.java:265)
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.append(FsDatasetImpl.java:1177)
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.append(FsDatasetImpl.java:1148)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:210)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:675)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: error=11, Resource temporarily unavailable
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 13 more
2018-06-20 02:05:54,797 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode is out of memory. Will retry in 30 seconds.
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:154)
at java.lang.Thread.run(Thread.java:748)
alse,I noted cloudera manager help us set the ulimit. here is our config:
if [ $(id -u) -eq 0 ]; then
# Max number of open files
ulimit -n 32768
# Max number of child processes and threads
ulimit -u 65536
# Max locked memory
ulimit -l unlimited
fi
ps: our machine is 72c 250g. could you help me that what the reason causes create native thread failed?