Created 12-22-2015 10:45 AM
Hi,
I run pyspark on my hadoop cluster using spark submit
spark-submit --master yarn-client --driver-memory 4g --executor-memory 6g --total-executor-cores 10 --num-executors 5 --conf spark.yarn.queue=alpha --conf spark.executor.instances=5 usr_recommendation.py
i got this error
java.io.FileNotFoundException: /hadoop/yarn/local/usercache/hdfs/appcache/application_1450771823865_0008/blockmgr-16947187-1ea7-4e42-a652-52559363c4d7/1f/temp_shuffle_9f5ed80b-eabc-4eb5-892a-ce5a7c0c0d0e (Too many open files)
and also this error
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1784711150-172.16.200.242-1447830226283:blk_1074661844_921035 file=/data/catalogs/visitor/1450770445/part-04752
is this configuration issue?
this is my setting on hdfs
hdfs_user_nproc_limit = 100000 hdfs_user_nofile_limit = 1000000
and this is my setting on yarn
yarn_user_nofile_limit = 100000 yarn_user_nproc_limit = 1000000
Created 12-22-2015 01:27 PM
There are 2 issues max open files and block missing. It looks like "too many open files " led to the next error.
Your settings looks ok but you may want to check this http://docs.hortonworks.com/HDPDocuments/Ambari-2....
hadoop dfsadmin -report
hadoopfsck /data/catalogs/visitor -files -blocks
Created on 12-22-2015 02:15 PM - edited 08-19-2019 05:27 AM
Hi @Neeraj Sabharwal , thanks for the response
I try
hadoop dfsadmin -report => Decommission Status : Normal
hadoop fsck /data/catalogs/visitor -files -blocks => filesystem under path '/data/catalogs/visitor' is HEALTHY
also i increase the ulimit open file to 1million
i think this error is because to many small file in my hdfs that read by the pyspark
do you know the best practice to merge small file like this picture bellow in hdfs to one file
so my pyspark is not opening to many file while running the modeling
Thanks
Created 12-29-2015 01:14 PM
In worst case i combine all the part file and write into a single file using
hadoop fs -cat output/part-* > localoutput/Result.txt
Created 12-28-2015 10:09 PM
Spark creates a bunch of intermediate files prior to a shuffle. If you're using many cores on your executors, a high level of parallelism, and many unique keys, you may run into this issue.
The first thing to try is to consolidate the intermediate files into fewer files. Pass in this on your spark-submit to consolidate the intermediate files:
--conf spark.shuffle.consolidateFiles=true
The above configuration is set to default. Here is more information directly from the apache page:
"If set to "true", consolidates intermediate files created during a shuffle. Creating fewer files can improve filesystem performance for shuffles with large numbers of reduce tasks. It is recommended to set this to "true" when using ext4 or xfs filesystems. On ext3, this option might degrade performance on machines with many (>8) cores due to filesystem limitations."
Created 03-30-2016 09:58 AM
I have the same issue. I tried setting this parameter in spark-defaults.conf as well passed along spark-submit (spark-submit file.py --conf spark.shuffle.consolidateFiles=true). Still getting the same error.
Created 02-02-2016 02:06 AM
@cokorda putra susila can you accept the best answer to close this thread?