Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark RDD partitions behavior in HDInsight (Azure)

Solved Go to solution
Highlighted

Spark RDD partitions behavior in HDInsight (Azure)

New Contributor

I am getting a strange behavior. I have a file stored in azure wasb (size 1 GB) when I create an RDD using below statement, it only creates two partitions. I am under impression it should be based on HDFS block size which is 128M in our environment.

val fileRDD = sc.textFile("/user/aahmed/file.csv")

Seems like its creates one partition for 500MB each. I tried it with one large file (28G) and I got 56 partitions.

It supposed to be based HDFS block size not based on 500MB

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Spark RDD partitions behavior in HDInsight (Azure)

@Adnan Ahmed

Actually the default "block size" for WASB IS 500MB. So that explains that.

http://i1.blogs.msdn.com/b/bigdatasupport/archive/2015/02/17/sqoop-job-performance-tuning-in-hdinsig...

"dfs.block.size which is represented by fs.azure.block.size in Windows Azure Storage Blob, WASB (set to 512 MB by default), max split size etc."

View solution in original post

3 REPLIES 3
Highlighted

Re: Spark RDD partitions behavior in HDInsight (Azure)

Very Curious the partitions are defined by minpartitions or the default splits in your inputformat. For a textfile this should be TextInputFormat. Below is code from HadoopRDD.scala.

So I don't think this is Spark doing anything wrongly. Can you try to read the same file in Pig and see how many tasks he creates in this case? Might be some wasb issue?

 override def getPartitions: Array[Partition] = {
    val jobConf = getJobConf()
    // add the credentials here as this can be called before SparkContext initialized
    SparkHadoopUtil.get.addCredentials(jobConf)
    val inputFormat = getInputFormat(jobConf)
    val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
    val array = new Array[Partition](inputSplits.size)
    for (i <- 0 until inputSplits.size) {
      array(i) = new HadoopPartition(id, i, inputSplits(i))
    }
    array
  }
Highlighted

Re: Spark RDD partitions behavior in HDInsight (Azure)

New Contributor

Hi Benjamin,

I tested that in my local and Hortonworks sandbox. Both places I get expected behavior, it is based on splits size. I think it is something related to wasb.

Thanks

Highlighted

Re: Spark RDD partitions behavior in HDInsight (Azure)

@Adnan Ahmed

Actually the default "block size" for WASB IS 500MB. So that explains that.

http://i1.blogs.msdn.com/b/bigdatasupport/archive/2015/02/17/sqoop-job-performance-tuning-in-hdinsig...

"dfs.block.size which is represented by fs.azure.block.size in Windows Azure Storage Blob, WASB (set to 512 MB by default), max split size etc."

View solution in original post

Don't have an account?
Coming from Hortonworks? Activate your account here