Created 05-13-2020 02:02 AM
Running a HortonWork hadoop cluster (HDP-3.1.0.0) and getting a bunch of
Failed on local exception: java.io.IOException: Too many open files
errors when running spark jobs that up until this point have worked fine.
I have seen many other questions like this where the answer is to increase the ulimit settings for open files and processes (this is also in the HDP docs) (and I'll note that I believe that mine are still at the system default settings), but...
My question is: Why is this only happening now when previously the spark jobs have been running fine for months?
The spark jobs I have been running have been running fine for months without incident and I have made no recent code changes. Don't know enough about the internals of spark to theorize about why things could be going wrong only now (would be odd to me if open files just build up in the course of running spark, but that seems like what is happening).
Just as an example, just this code...
. . .sparkSession = SparkSession.builder.appName("GET_TABLE_COUNT").getOrCreate()sparkSession._jsc.sc().getExecutorMemoryStatus().keySet().size() . . .
now generates errors like...
. . . [2020-05-12 19:04:45,810] {bash_operator.py:128} INFO - 20/05/12 19:04:45 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED) [2020-05-12 19:04:46,813] {bash_operator.py:128} INFO - 20/05/12 19:04:46 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED) [2020-05-12 19:04:47,816] {bash_operator.py:128} INFO - 20/05/12 19:04:47 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED) [2020-05-12 19:04:48,818] {bash_operator.py:128} INFO - 20/05/12 19:04:48 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED) [2020-05-12 19:04:49,820] {bash_operator.py:128} INFO - 20/05/12 19:04:49 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED) [2020-05-12 19:04:50,822] {bash_operator.py:128} INFO - 20/05/12 19:04:50 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED) [2020-05-12 19:04:51,828] {bash_operator.py:128} INFO - 20/05/12 19:04:51 INFO Client: Application report for application_1579648183118_19918 (state: FAILED) [2020-05-12 19:04:51,829] {bash_operator.py:128} INFO - 20/05/12 19:04:51 INFO Client: [2020-05-12 19:04:51,829] {bash_operator.py:128} INFO - client token: N/A[2020-05-12 19:04:51,829] {bash_operator.py:128} INFO - diagnostics: Application application_1579648183118_19918 failed 2 times due to Error launching appattempt_1579648183118_19918_000002. Got exception: java.io.IOException: DestHost:destPort hw005.co.local:45454 , LocalHost:localPort hw001.co.local/172.18.4.46:0. Failed on local exception: java.io.IOException: Too many open files
[2020-05-12 19:04:51,829] {bash_operator.py:128} INFO - at sun.reflect.GeneratedConstructorAccessor808.newInstance(Unknown Source) [2020-05-12 19:04:51,829] {bash_operator.py:128} INFO - at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
My RAM and ulimit setting on the cluster look like...
[root@HW001]# clush -ab free -h---------------HW001--------------- total used free shared buff/cache availableMem: 31G 9.0G 1.1G 1.7G 21G 19G Swap: 8.5G 44K 8.5G ---------------HW002--------------- total used free shared buff/cache availableMem: 31G 7.3G 5.6G 568M 18G 22G Swap: 8.5G 308K 8.5G ---------------HW003--------------- total used free shared buff/cache availableMem: 31G 6.1G 4.0G 120M 21G 24G Swap: 8.5G 200K 8.5G ---------------HW004--------------- total used free shared buff/cache availableMem: 31G 2.9G 2.8G 120M 25G 27G Swap: 8.5G 28K 8.5G ---------------HW005--------------- total used free shared buff/cache availableMem: 31G 2.9G 4.6G 120M 23G 27G Swap: 8.5G 20K 8.5G ---------------airflowetl--------------- total used free shared buff/cache availableMem: 46G 5.3G 13G 2.4G 28G 38G Swap: 8.5G 124K 8.5G [root@HW001]# [root@HW001]# [root@HW001]# [root@HW001]# clush -ab ulimit -a
---------------HW[001-005] (5)---------------
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited scheduling priority (-e) 0
file size (blocks, -f) unlimited pending signals (-i) 127886
max locked memory (kbytes, -l) 64max
memory size (kbytes, -m) unlimited open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited max user processes (-u) 127886
virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
---------------airflowetl---------------
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited scheduling priority (-e) 0
file size (blocks, -f) unlimited pending signals (-i) 192394
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited max user processes (-u) 192394
virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
Don't know much about Hadoop admin, but just looking at the Ambari dashboard, the cluster does not seem to be overly taxed...
(though could not actually check the RM web UI, since it just throws a "too many open files" error).
Anyone with more spark/hadoop experience know why this would be happening now?
Created 05-13-2020 04:56 AM
your PB may be caused by caused by several reasons
firstly i think 1024 is not enough, you should increase it
opened files may be increasing day after day ( an application may stream more data from/into splitted files)
a spark application may also import/open more libraries today
etc...
please check opened file by the user (that runs spark jobs) to find the possible cause
lsof -u myUser ( | wc -l ... )
check lsof (lsof +D directory) , and find how many opened files per job and how many jobs are runing etc...
Created 05-14-2020 03:53 AM
Hello @rvillanueva ,
You can check how many threads are used by a user by running ps -L -u <username> | wc -l
if the user’s open files limit ( ulimit -n <user name >) is hit then the user can’t spawn any further more threads. Most possible reasons in this case could be,
Kindly check application log (application_XXX),if available and see which phase it throw's the exception and on which node the issue is faced.
Madhuri Adipudi, Technical Solutions Manager
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community: