Support Questions

Find answers, ask questions, and share your expertise

SPARK number of partitions/tasks while reading a file


Could someone tell me the answer of below question, why and how?

Q. How many partitions shall "intialiy" be created with the following command on spark shell- sc.textfile("hdfs://user/cloudera/csvfiles") There are 100 files in directory /user/cloudera/csvfiles and there are 10 nodes running Spark. a.1 b.10 c.20 d.100


@Sandeep Ahuja,

textFile() partitions based on the number of HDFS blocks the file uses. If the file is only 1 block, then RDD is initialized with minimum of 2 partitions. If you want to increase the minimum no of partitions then you can pass an argument for it like below

files = sc.textfile("hdfs://user/cloudera/csvfiles",minPartitions=10)

If you want to check the no of partitions, you can run the below statement


Note: If you set the minPartitions to less than the no of HDFS blocks, spark will automatically set the min partitions to the no of hdfs blocks and doesn't give any error.


Please "Accept" the answer if this helps or revert back for any questions.