- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
increasing textFile() partitioning number anomoly
- Labels:
-
Apache Spark
Created 05-27-2016 10:13 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've got a weird/wild one for sure and wondering if anyone has any insight. Heck, I'm giving out "BONUS POINTS" for this one. I'm dabbling with using sc.textFile()'s minPartition optional parameter to make my Hadoop file have more RDD partitions than the number of HDFS blocks.
When testing with a single-block HDFS file, all works fine when I get up to 8 partitions, but at 9 onward, it seems to add an extra number of partitions as shown below.
>>> rdd1 = sc.textFile("statePopulations.csv",8) >>> rdd1.getNumPartitions() 8 >>> rdd1 = sc.textFile("statePopulations.csv",9) >>> rdd1.getNumPartitions() 10 >>> rdd1 = sc.textFile("statePopulations.csv",10) >>> rdd1.getNumPartitions() 11
I was wondering if there was some magical implementation activity happening at 9 partitions (or 9x the number of blocks), but I didn't see a similar behavior on a 5-block file I have.
>>> rdd2 = sc.textFile("/proto/2000.csv") >>> rdd2.getNumPartitions() 5 >>> rdd2 = sc.textFile("/proto/2000.csv",9) >>> rdd2.getNumPartitions() 9 >>> rdd2 = sc.textFile("/proto/2000.csv",45) >>> rdd2.getNumPartitions() 45
Really not a pressing concern, but sure has made me ask WTH? (What The Hadoop?) Anyone know what's going on?
Created 05-29-2016 05:20 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Look at this blog post which describe the internal working of textFile : http://www.bigsynapse.com/spark-input-output
This PR discussion gives you the rational on why the default values are what they are : https://github.com/mesos/spark/pull/718
Hope this helps
Created 05-29-2016 04:50 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Under the cover Spark uses a Hadoop TextInputformat to read the file. The minPartitions number is given as an input to the FileInputFormat getSplits method.
This function is pretty complex and uses a goalSize, blockSize and minSize to split up the file into splits. goalsize being totalsize/numbersplits.
Looking at it it normally should honour your request but you might be running into a scenario where you have a very small file and run into some rounding issues. You could try running the code with your blocksize to see if that is the case..
It should not matter though since Hadoop will make sure that each record is processed exactly once. ( By ignoring the first unfinished record of any block and overreading the split to finalize the last record. ), Y
Created 05-29-2016 05:20 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Look at this blog post which describe the internal working of textFile : http://www.bigsynapse.com/spark-input-output
This PR discussion gives you the rational on why the default values are what they are : https://github.com/mesos/spark/pull/718
Hope this helps
