Created 05-27-2016 10:13 PM
I've got a weird/wild one for sure and wondering if anyone has any insight. Heck, I'm giving out "BONUS POINTS" for this one. I'm dabbling with using sc.textFile()'s minPartition optional parameter to make my Hadoop file have more RDD partitions than the number of HDFS blocks.
When testing with a single-block HDFS file, all works fine when I get up to 8 partitions, but at 9 onward, it seems to add an extra number of partitions as shown below.
>>> rdd1 = sc.textFile("statePopulations.csv",8) >>> rdd1.getNumPartitions() 8 >>> rdd1 = sc.textFile("statePopulations.csv",9) >>> rdd1.getNumPartitions() 10 >>> rdd1 = sc.textFile("statePopulations.csv",10) >>> rdd1.getNumPartitions() 11
I was wondering if there was some magical implementation activity happening at 9 partitions (or 9x the number of blocks), but I didn't see a similar behavior on a 5-block file I have.
>>> rdd2 = sc.textFile("/proto/2000.csv") >>> rdd2.getNumPartitions() 5 >>> rdd2 = sc.textFile("/proto/2000.csv",9) >>> rdd2.getNumPartitions() 9 >>> rdd2 = sc.textFile("/proto/2000.csv",45) >>> rdd2.getNumPartitions() 45
Really not a pressing concern, but sure has made me ask WTH? (What The Hadoop?) Anyone know what's going on?
Created 05-29-2016 05:20 PM
Look at this blog post which describe the internal working of textFile : http://www.bigsynapse.com/spark-input-output
This PR discussion gives you the rational on why the default values are what they are : https://github.com/mesos/spark/pull/718
Hope this helps
Created 05-29-2016 04:50 PM
Under the cover Spark uses a Hadoop TextInputformat to read the file. The minPartitions number is given as an input to the FileInputFormat getSplits method.
This function is pretty complex and uses a goalSize, blockSize and minSize to split up the file into splits. goalsize being totalsize/numbersplits.
Looking at it it normally should honour your request but you might be running into a scenario where you have a very small file and run into some rounding issues. You could try running the code with your blocksize to see if that is the case..
It should not matter though since Hadoop will make sure that each record is processed exactly once. ( By ignoring the first unfinished record of any block and overreading the split to finalize the last record. ), Y
Created 05-29-2016 05:20 PM
Look at this blog post which describe the internal working of textFile : http://www.bigsynapse.com/spark-input-output
This PR discussion gives you the rational on why the default values are what they are : https://github.com/mesos/spark/pull/718
Hope this helps