I run pyspark on my hadoop cluster using spark submit
spark-submit --master yarn-client --driver-memory 4g --executor-memory 6g --total-executor-cores 10 --num-executors 5 --conf spark.yarn.queue=alpha --conf spark.executor.instances=5 usr_recommendation.py
i got this error
java.io.FileNotFoundException: /hadoop/yarn/local/usercache/hdfs/appcache/application_1450771823865_0008/blockmgr-16947187-1ea7-4e42-a652-52559363c4d7/1f/temp_shuffle_9f5ed80b-eabc-4eb5-892a-ce5a7c0c0d0e (Too many open files)
and also this error
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1784711150-172.16.200.242-1447830226283:blk_1074661844_921035 file=/data/catalogs/visitor/1450770445/part-04752
is this configuration issue?
this is my setting on hdfs
hdfs_user_nproc_limit = 100000 hdfs_user_nofile_limit = 1000000
and this is my setting on yarn
yarn_user_nofile_limit = 100000 yarn_user_nproc_limit = 1000000
There are 2 issues max open files and block missing. It looks like "too many open files " led to the next error.
Your settings looks ok but you may want to check this http://docs.hortonworks.com/HDPDocuments/Ambari-2....
hadoop dfsadmin -report
hadoopfsck /data/catalogs/visitor -files -blocks
Hi @Neeraj Sabharwal , thanks for the response
hadoop dfsadmin -report => Decommission Status : Normal
hadoop fsck /data/catalogs/visitor -files -blocks => filesystem under path '/data/catalogs/visitor' is HEALTHY
also i increase the ulimit open file to 1million
i think this error is because to many small file in my hdfs that read by the pyspark
do you know the best practice to merge small file like this picture bellow in hdfs to one file
so my pyspark is not opening to many file while running the modeling
In worst case i combine all the part file and write into a single file using
hadoop fs -cat output/part-* > localoutput/Result.txt
Spark creates a bunch of intermediate files prior to a shuffle. If you're using many cores on your executors, a high level of parallelism, and many unique keys, you may run into this issue.
The first thing to try is to consolidate the intermediate files into fewer files. Pass in this on your spark-submit to consolidate the intermediate files:
The above configuration is set to default. Here is more information directly from the apache page:
"If set to "true", consolidates intermediate files created during a shuffle. Creating fewer files can improve filesystem performance for shuffles with large numbers of reduce tasks. It is recommended to set this to "true" when using ext4 or xfs filesystems. On ext3, this option might degrade performance on machines with many (>8) cores due to filesystem limitations."
I have the same issue. I tried setting this parameter in spark-defaults.conf as well passed along spark-submit (spark-submit file.py --conf spark.shuffle.consolidateFiles=true). Still getting the same error.