Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark RDD partitions behavior in HDInsight (Azure)

avatar
Explorer

I am getting a strange behavior. I have a file stored in azure wasb (size 1 GB) when I create an RDD using below statement, it only creates two partitions. I am under impression it should be based on HDFS block size which is 128M in our environment.

val fileRDD = sc.textFile("/user/aahmed/file.csv")

Seems like its creates one partition for 500MB each. I tried it with one large file (28G) and I got 56 partitions.

It supposed to be based HDFS block size not based on 500MB

1 ACCEPTED SOLUTION

avatar
Master Guru
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
3 REPLIES 3

avatar
Master Guru

Very Curious the partitions are defined by minpartitions or the default splits in your inputformat. For a textfile this should be TextInputFormat. Below is code from HadoopRDD.scala.

So I don't think this is Spark doing anything wrongly. Can you try to read the same file in Pig and see how many tasks he creates in this case? Might be some wasb issue?

 override def getPartitions: Array[Partition] = {
    val jobConf = getJobConf()
    // add the credentials here as this can be called before SparkContext initialized
    SparkHadoopUtil.get.addCredentials(jobConf)
    val inputFormat = getInputFormat(jobConf)
    val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
    val array = new Array[Partition](inputSplits.size)
    for (i <- 0 until inputSplits.size) {
      array(i) = new HadoopPartition(id, i, inputSplits(i))
    }
    array
  }

avatar
Explorer

Hi Benjamin,

I tested that in my local and Hortonworks sandbox. Both places I get expected behavior, it is based on splits size. I think it is something related to wasb.

Thanks

avatar
Master Guru
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login