Support Questions

leo_adnan · ‎04-25-2016

I am getting a strange behavior. I have a file stored in azure wasb (size 1 GB) when I create an RDD using below statement, it only creates two partitions. I am under impression it should be based on HDFS block size which is 128M in our environment.

val fileRDD = sc.textFile("/user/aahmed/file.csv")

Seems like its creates one partition for 500MB each. I tried it with one large file (28G) and I got 56 partitions.

It supposed to be based HDFS block size not based on 500MB

bleonhardi · ‎04-26-2016

@Adnan Ahmed

Actually the default "block size" for WASB IS 500MB. So that explains that.

http://i1.blogs.msdn.com/b/bigdatasupport/archive/2015/02/17/sqoop-job-performance-tuning-in-hdinsig...

"dfs.block.size which is represented by fs.azure.block.size in Windows Azure Storage Blob, WASB (set to 512 MB by default), max split size etc."

View solution in original post

bleonhardi · ‎04-25-2016

Very Curious the partitions are defined by minpartitions or the default splits in your inputformat. For a textfile this should be TextInputFormat. Below is code from HadoopRDD.scala.

So I don't think this is Spark doing anything wrongly. Can you try to read the same file in Pig and see how many tasks he creates in this case? Might be some wasb issue?

 override def getPartitions: Array[Partition] = {
    val jobConf = getJobConf()
    // add the credentials here as this can be called before SparkContext initialized
    SparkHadoopUtil.get.addCredentials(jobConf)
    val inputFormat = getInputFormat(jobConf)
    val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
    val array = new Array[Partition](inputSplits.size)
    for (i <- 0 until inputSplits.size) {
      array(i) = new HadoopPartition(id, i, inputSplits(i))
    }
    array
  }

leo_adnan · ‎04-25-2016

Hi Benjamin,

I tested that in my local and Hortonworks sandbox. Both places I get expected behavior, it is based on splits size. I think it is something related to wasb.

Thanks

bleonhardi · ‎04-26-2016

@Adnan Ahmed

Actually the default "block size" for WASB IS 500MB. So that explains that.

http://i1.blogs.msdn.com/b/bigdatasupport/archive/2015/02/17/sqoop-job-performance-tuning-in-hdinsig...

"dfs.block.size which is represented by fs.azure.block.size in Windows Azure Storage Blob, WASB (set to 512 MB by default), max split size etc."

Cloudera Community

Support Questions

Spark RDD partitions behavior in HDInsight (Azure)