Support Questions

Find answers, ask questions, and share your expertise
Celebrating as our community reaches 100,000 members! Thank you!

SPARK number of partitions/tasks while reading a file


Could someone tell me the answer of below question, why and how?

Q. How many partitions shall "intialiy" be created with the following command on spark shell- sc.textfile("hdfs://user/cloudera/csvfiles") There are 100 files in directory /user/cloudera/csvfiles and there are 10 nodes running Spark. a.1 b.10 c.20 d.100


Super Guru

@Sandeep Ahuja,

textFile() partitions based on the number of HDFS blocks the file uses. If the file is only 1 block, then RDD is initialized with minimum of 2 partitions. If you want to increase the minimum no of partitions then you can pass an argument for it like below

files = sc.textfile("hdfs://user/cloudera/csvfiles",minPartitions=10)

If you want to check the no of partitions, you can run the below statement


Note: If you set the minPartitions to less than the no of HDFS blocks, spark will automatically set the min partitions to the no of hdfs blocks and doesn't give any error.


Please "Accept" the answer if this helps or revert back for any questions.