Created 04-25-2016 05:01 PM
I am getting a strange behavior. I have a file stored in azure wasb (size 1 GB) when I create an RDD using below statement, it only creates two partitions. I am under impression it should be based on HDFS block size which is 128M in our environment.
val fileRDD = sc.textFile("/user/aahmed/file.csv")
Seems like its creates one partition for 500MB each. I tried it with one large file (28G) and I got 56 partitions.
It supposed to be based HDFS block size not based on 500MB
Created 04-26-2016 04:51 PM
Actually the default "block size" for WASB IS 500MB. So that explains that.
"dfs.block.size which is represented by fs.azure.block.size in Windows Azure Storage Blob, WASB (set to 512 MB by default), max split size etc."
Created 04-25-2016 05:18 PM
Very Curious the partitions are defined by minpartitions or the default splits in your inputformat. For a textfile this should be TextInputFormat. Below is code from HadoopRDD.scala.
So I don't think this is Spark doing anything wrongly. Can you try to read the same file in Pig and see how many tasks he creates in this case? Might be some wasb issue?
override def getPartitions: Array[Partition] = { val jobConf = getJobConf() // add the credentials here as this can be called before SparkContext initialized SparkHadoopUtil.get.addCredentials(jobConf) val inputFormat = getInputFormat(jobConf) val inputSplits = inputFormat.getSplits(jobConf, minPartitions) val array = new Array[Partition](inputSplits.size) for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) } array }
Created 04-25-2016 10:38 PM
Hi Benjamin,
I tested that in my local and Hortonworks sandbox. Both places I get expected behavior, it is based on splits size. I think it is something related to wasb.
Thanks
Created 04-26-2016 04:51 PM
Actually the default "block size" for WASB IS 500MB. So that explains that.
"dfs.block.size which is represented by fs.azure.block.size in Windows Azure Storage Blob, WASB (set to 512 MB by default), max split size etc."