Support Questions

Find answers, ask questions, and share your expertise

Previously working spark jobs only now throwing "java.io.IOException: Too many open files" error?

avatar
Expert Contributor

Running a HortonWork hadoop cluster (HDP-3.1.0.0) and getting a bunch of

Failed on local exception: java.io.IOException: Too many open files

errors when running spark jobs that up until this point have worked fine.

I have seen many other questions like this where the answer is to increase the ulimit settings for open files and processes (this is also in the HDP docs) (and I'll note that I believe that mine are still at the system default settings), but...

 

My question is: Why is this only happening now when previously the spark jobs have been running fine for months?

 

The spark jobs I have been running have been running fine for months without incident and I have made no recent code changes. Don't know enough about the internals of spark to theorize about why things could be going wrong only now (would be odd to me if open files just build up in the course of running spark, but that seems like what is happening).

Just as an example, just this code...

.
.
.sparkSession = SparkSession.builder.appName("GET_TABLE_COUNT").getOrCreate()sparkSession._jsc.sc().getExecutorMemoryStatus().keySet().size()
.
.
.

now generates errors like...

.
.
.
[2020-05-12 19:04:45,810] {bash_operator.py:128} INFO - 20/05/12 19:04:45 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED)
[2020-05-12 19:04:46,813] {bash_operator.py:128} INFO - 20/05/12 19:04:46 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED)
[2020-05-12 19:04:47,816] {bash_operator.py:128} INFO - 20/05/12 19:04:47 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED)
[2020-05-12 19:04:48,818] {bash_operator.py:128} INFO - 20/05/12 19:04:48 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED)
[2020-05-12 19:04:49,820] {bash_operator.py:128} INFO - 20/05/12 19:04:49 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED)
[2020-05-12 19:04:50,822] {bash_operator.py:128} INFO - 20/05/12 19:04:50 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED)
[2020-05-12 19:04:51,828] {bash_operator.py:128} INFO - 20/05/12 19:04:51 INFO Client: Application report for application_1579648183118_19918 (state: FAILED)
[2020-05-12 19:04:51,829] {bash_operator.py:128} INFO - 20/05/12 19:04:51 INFO Client:
[2020-05-12 19:04:51,829] {bash_operator.py:128} INFO -      client token: N/A[2020-05-12 19:04:51,829] {bash_operator.py:128} INFO -      diagnostics: Application application_1579648183118_19918 failed 2 times due to Error launching appattempt_1579648183118_19918_000002. Got exception: java.io.IOException: DestHost:destPort hw005.co.local:45454 , LocalHost:localPort hw001.co.local/172.18.4.46:0. Failed on local exception: java.io.IOException: Too many open files
[2020-05-12 19:04:51,829] {bash_operator.py:128} INFO - at sun.reflect.GeneratedConstructorAccessor808.newInstance(Unknown Source) [2020-05-12 19:04:51,829] {bash_operator.py:128} INFO - at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

My RAM and ulimit setting on the cluster look like...

[root@HW001]# clush -ab free -h---------------HW001---------------              total        used        free      shared  buff/cache   availableMem:            31G        9.0G        1.1G        1.7G         21G         19G
Swap:          8.5G         44K        8.5G
---------------HW002---------------              total        used        free      shared  buff/cache   availableMem:            31G        7.3G        5.6G        568M         18G         22G
Swap:          8.5G        308K        8.5G
---------------HW003---------------              total        used        free      shared  buff/cache   availableMem:            31G        6.1G        4.0G        120M         21G         24G
Swap:          8.5G        200K        8.5G
---------------HW004---------------              total        used        free      shared  buff/cache   availableMem:            31G        2.9G        2.8G        120M         25G         27G
Swap:          8.5G         28K        8.5G
---------------HW005---------------              total        used        free      shared  buff/cache   availableMem:            31G        2.9G        4.6G        120M         23G         27G
Swap:          8.5G         20K        8.5G
---------------airflowetl---------------              total        used        free      shared  buff/cache   availableMem:            46G        5.3G         13G        2.4G         28G         38G
Swap:          8.5G        124K        8.5G
[root@HW001]#
[root@HW001]#
[root@HW001]#
[root@HW001]# clush -ab ulimit -a
---------------HW[001-005] (5)---------------
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited scheduling priority (-e) 0
file size (blocks, -f) unlimited pending signals (-i) 127886
max locked memory (kbytes, -l) 64max
memory size
(kbytes, -m) unlimited open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited max user processes (-u) 127886
virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
---------------airflowetl---------------
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited scheduling priority (-e) 0
file size (blocks, -f) unlimited pending signals (-i) 192394
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited max user processes (-u) 192394
virtual memory (kbytes, -v) unlimited file locks (-x) unlimited

Don't know much about Hadoop admin, but just looking at the Ambari dashboard, the cluster does not seem to be overly taxed... 

Capture001.PNG

 (though could not actually check the RM web UI, since it just throws a "too many open files" error).

Anyone with more spark/hadoop experience know why this would be happening now?

2 REPLIES 2

avatar
Contributor

your PB may be caused by caused by several reasons

 

firstly i think 1024 is not enough, you should increase it

 

opened files may be increasing day after day ( an application may stream more data from/into splitted files)

a spark application may also import/open more libraries today

etc...

 

please check opened file by the user (that runs spark jobs) to find the possible cause

lsof -u myUser ( | wc -l ... )

 

check lsof (lsof +D directory) , and find how many opened files per job and how many jobs are runing etc...

 

 

avatar
Moderator

Hello @rvillanueva ,

 

You can check how many threads are used by a user by running ps -L -u <username> | wc -l

 

if the user’s open files  limit ( ulimit -n <user name >) is  hit then the user can’t spawn any further more threads. Most possible reasons in this case could be,

  1. Same user running other jobs and having open files on the node where it tries to launch/spawn the container.
  2. systems thread might have excluded.
  3. see which application is running and what is their current open files

Kindly check application log (application_XXX),if available and see which phase it throw's the exception and on which node the issue is faced.

 


Madhuri Adipudi, Technical Solutions Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Learn more about the Cloudera Community: