Support Questions

Find answers, ask questions, and share your expertise

Spark RDD partitions behavior in HDInsight (Azure)

avatar
Explorer

I am getting a strange behavior. I have a file stored in azure wasb (size 1 GB) when I create an RDD using below statement, it only creates two partitions. I am under impression it should be based on HDFS block size which is 128M in our environment.

val fileRDD = sc.textFile("/user/aahmed/file.csv")

Seems like its creates one partition for 500MB each. I tried it with one large file (28G) and I got 56 partitions.

It supposed to be based HDFS block size not based on 500MB

1 ACCEPTED SOLUTION

avatar
Master Guru

@Adnan Ahmed

Actually the default "block size" for WASB IS 500MB. So that explains that.

http://i1.blogs.msdn.com/b/bigdatasupport/archive/2015/02/17/sqoop-job-performance-tuning-in-hdinsig...

"dfs.block.size which is represented by fs.azure.block.size in Windows Azure Storage Blob, WASB (set to 512 MB by default), max split size etc."

View solution in original post

3 REPLIES 3

avatar
Master Guru

Very Curious the partitions are defined by minpartitions or the default splits in your inputformat. For a textfile this should be TextInputFormat. Below is code from HadoopRDD.scala.

So I don't think this is Spark doing anything wrongly. Can you try to read the same file in Pig and see how many tasks he creates in this case? Might be some wasb issue?

 override def getPartitions: Array[Partition] = {
    val jobConf = getJobConf()
    // add the credentials here as this can be called before SparkContext initialized
    SparkHadoopUtil.get.addCredentials(jobConf)
    val inputFormat = getInputFormat(jobConf)
    val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
    val array = new Array[Partition](inputSplits.size)
    for (i <- 0 until inputSplits.size) {
      array(i) = new HadoopPartition(id, i, inputSplits(i))
    }
    array
  }

avatar
Explorer

Hi Benjamin,

I tested that in my local and Hortonworks sandbox. Both places I get expected behavior, it is based on splits size. I think it is something related to wasb.

Thanks

avatar
Master Guru

@Adnan Ahmed

Actually the default "block size" for WASB IS 500MB. So that explains that.

http://i1.blogs.msdn.com/b/bigdatasupport/archive/2015/02/17/sqoop-job-performance-tuning-in-hdinsig...

"dfs.block.size which is represented by fs.azure.block.size in Windows Azure Storage Blob, WASB (set to 512 MB by default), max split size etc."