Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

increasing textFile() partitioning number anomoly

avatar

I've got a weird/wild one for sure and wondering if anyone has any insight. Heck, I'm giving out "BONUS POINTS" for this one. I'm dabbling with using sc.textFile()'s minPartition optional parameter to make my Hadoop file have more RDD partitions than the number of HDFS blocks.

When testing with a single-block HDFS file, all works fine when I get up to 8 partitions, but at 9 onward, it seems to add an extra number of partitions as shown below.

>>> rdd1 = sc.textFile("statePopulations.csv",8)
>>> rdd1.getNumPartitions()
8
>>> rdd1 = sc.textFile("statePopulations.csv",9)
>>> rdd1.getNumPartitions()
10
>>> rdd1 = sc.textFile("statePopulations.csv",10)
>>> rdd1.getNumPartitions()
11 

I was wondering if there was some magical implementation activity happening at 9 partitions (or 9x the number of blocks), but I didn't see a similar behavior on a 5-block file I have.

>>> rdd2 = sc.textFile("/proto/2000.csv")
>>> rdd2.getNumPartitions()
5
>>> rdd2 = sc.textFile("/proto/2000.csv",9)
>>> rdd2.getNumPartitions()
9
>>> rdd2 = sc.textFile("/proto/2000.csv",45)
>>> rdd2.getNumPartitions()
45 

Really not a pressing concern, but sure has made me ask WTH? (What The Hadoop?) Anyone know what's going on?

1 ACCEPTED SOLUTION

avatar

Hi @Lester Martin

Look at this blog post which describe the internal working of textFile : http://www.bigsynapse.com/spark-input-output

This PR discussion gives you the rational on why the default values are what they are : https://github.com/mesos/spark/pull/718

Hope this helps

View solution in original post

2 REPLIES 2

avatar
Master Guru

Under the cover Spark uses a Hadoop TextInputformat to read the file. The minPartitions number is given as an input to the FileInputFormat getSplits method.

http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1-hudson-2/...

This function is pretty complex and uses a goalSize, blockSize and minSize to split up the file into splits. goalsize being totalsize/numbersplits.

Looking at it it normally should honour your request but you might be running into a scenario where you have a very small file and run into some rounding issues. You could try running the code with your blocksize to see if that is the case..

It should not matter though since Hadoop will make sure that each record is processed exactly once. ( By ignoring the first unfinished record of any block and overreading the split to finalize the last record. ), Y

avatar

Hi @Lester Martin

Look at this blog post which describe the internal working of textFile : http://www.bigsynapse.com/spark-input-output

This PR discussion gives you the rational on why the default values are what they are : https://github.com/mesos/spark/pull/718

Hope this helps