Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Pyspark machine learning gets too many open file error

Highlighted

Pyspark machine learning gets too many open file error

Hi,

I run pyspark on my hadoop cluster using spark submit

spark-submit --master yarn-client --driver-memory 4g --executor-memory 6g --total-executor-cores 10 --num-executors 5 --conf spark.yarn.queue=alpha --conf spark.executor.instances=5 usr_recommendation.py

i got this error

java.io.FileNotFoundException: /hadoop/yarn/local/usercache/hdfs/appcache/application_1450771823865_0008/blockmgr-16947187-1ea7-4e42-a652-52559363c4d7/1f/temp_shuffle_9f5ed80b-eabc-4eb5-892a-ce5a7c0c0d0e (Too many open files)

and also this error

org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1784711150-172.16.200.242-1447830226283:blk_1074661844_921035 file=/data/catalogs/visitor/1450770445/part-04752

is this configuration issue?

this is my setting on hdfs

hdfs_user_nproc_limit = 100000 hdfs_user_nofile_limit = 1000000

and this is my setting on yarn

yarn_user_nofile_limit = 100000 yarn_user_nproc_limit = 1000000

6 REPLIES 6
Highlighted

Re: Pyspark machine learning gets too many open file error

@cokorda putra susila

There are 2 issues max open files and block missing. It looks like "too many open files " led to the next error.

Your settings looks ok but you may want to check this http://docs.hortonworks.com/HDPDocuments/Ambari-2....

hadoop dfsadmin -report

hadoopfsck /data/catalogs/visitor -files -blocks

Highlighted

Re: Pyspark machine learning gets too many open file error

Hi @Neeraj Sabharwal , thanks for the response

I try

hadoop dfsadmin -report => Decommission Status : Normal

hadoop fsck /data/catalogs/visitor -files -blocks => filesystem under path '/data/catalogs/visitor' is HEALTHY

also i increase the ulimit open file to 1million

i think this error is because to many small file in my hdfs that read by the pyspark

do you know the best practice to merge small file like this picture bellow in hdfs to one file

so my pyspark is not opening to many file while running the modeling

Thanks

1001-screen-shot-2015-12-22-at-91257-pm.png

Highlighted

Re: Pyspark machine learning gets too many open file error

New Contributor

In worst case i combine all the part file and write into a single file using

hadoop fs -cat output/part-* > localoutput/Result.txt

Highlighted

Re: Pyspark machine learning gets too many open file error

Expert Contributor

Spark creates a bunch of intermediate files prior to a shuffle. If you're using many cores on your executors, a high level of parallelism, and many unique keys, you may run into this issue.

The first thing to try is to consolidate the intermediate files into fewer files. Pass in this on your spark-submit to consolidate the intermediate files:

--conf spark.shuffle.consolidateFiles=true

The above configuration is set to default. Here is more information directly from the apache page:

"If set to "true", consolidates intermediate files created during a shuffle. Creating fewer files can improve filesystem performance for shuffles with large numbers of reduce tasks. It is recommended to set this to "true" when using ext4 or xfs filesystems. On ext3, this option might degrade performance on machines with many (>8) cores due to filesystem limitations."

Re: Pyspark machine learning gets too many open file error

New Contributor

I have the same issue. I tried setting this parameter in spark-defaults.conf as well passed along spark-submit (spark-submit file.py --conf spark.shuffle.consolidateFiles=true). Still getting the same error.

Highlighted

Re: Pyspark machine learning gets too many open file error

Mentor

@cokorda putra susila can you accept the best answer to close this thread?

Don't have an account?
Coming from Hortonworks? Activate your account here