Created 03-23-2018 06:49 AM
YARN Resource Manager Halts with the OOM : Unable to create native thread and the Job fails over to standby Resource Manager in completing the Task.
How could i get this resolved ?
ERROR Message :
2018-03-22 02:30:09,637 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e189_1521451854044_2288_01_000002 Container Transitioned from ALLOCATED to ACQUIRED
2018-03-22 02:30:10,413 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e189_1521451854044_2288_01_000002 Container Transitioned from ACQUIRED to RUNNING
2018-03-22 02:30:10,695 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: checking for deactivate...
2018-03-22 02:30:19,354 INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: hue is accessing unchecked http://server1:43045/ws/v1/mapreduce/jobs/job_1521451854044_2288 which is the app master GUI of application_1521451854044_2288 owned by edh_srv_prod
2018-03-22 02:30:30,212 INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: hue is accessing unchecked http://server1:43045/ws/v1/mapreduce/jobs/job_1521451854044_2288 which is the app master GUI of application_1521451854044_2288 owned by edh_srv_prod
2018-03-22 02:30:34,090 FATAL org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[2101925946@qtp-1878992188-14302,5,main] threw an Error. Shutting down now...
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1095)
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
at org.mortbay.jetty.security.SslSocketConnector$SslConnection.run(SslSocketConnector.java:723)
at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
2018-03-22 02:30:34,093 INFO org.apache.hadoop.util.ExitUtil: Halt with status -1 Message: HaltException
yarn application -status application_1521451854044_2288
Application Report :
Application-Id : application_1521451854044_2288
Application-Name : oozie:launcher:T=shell:W=OS_Changes_incremental_workflow:A=shell-b8b2:ID=0006766-180222181315002-oozie-oozi-W
Application-Type : MAPREDUCE
User : edh_srv_prod
Queue : root.edh_srv_prod
Start-Time : 1521710999557
Finish-Time : 1521711593154
Progress : 100%
State : FINISHED
Final-State : SUCCEEDED
Tracking-URL : https://server1:19890/jobhistory/job/job_1521451854044_2288
RPC Port : 40930
AM Host : server3
Aggregate Resource Allocation : 1809548 MB-seconds, 1181 vcore-seconds
Log Aggregation Status : SUCCEEDED
Diagnostics : Attempt recovered after RM restart
Created 03-23-2018 03:22 PM
It's likely that the host has run out of PIDs, and that's why the RM can't create a new thread. Here are some commands that can help you identify whether this is the issue and increase the maximum number of PIDs allowed.
Check the number of threads running:
ps -elfT | wc -l
Check the current pid_max:
sysctl kernel.pid_max
Increase the pid_max:
sysctl -w kernel.pid_max=4194304
Created 03-23-2018 03:51 PM
Another command that might be informative is checking the last PID assigned:
sysctl kernel.ns_last_pid
Created 03-23-2018 03:57 PM
Thank you... I shall make the required changes and keep an watch on the same
Created 03-26-2018 07:45 PM
Please accept the answer if it fixes your problem.