Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

SPARK number of partitions/tasks while reading a file

avatar
Explorer

Could someone tell me the answer of below question, why and how?

Q. How many partitions shall "intialiy" be created with the following command on spark shell- sc.textfile("hdfs://user/cloudera/csvfiles") There are 100 files in directory /user/cloudera/csvfiles and there are 10 nodes running Spark. a.1 b.10 c.20 d.100

1 REPLY 1

avatar
Super Guru

@Sandeep Ahuja,

textFile() partitions based on the number of HDFS blocks the file uses. If the file is only 1 block, then RDD is initialized with minimum of 2 partitions. If you want to increase the minimum no of partitions then you can pass an argument for it like below

files = sc.textfile("hdfs://user/cloudera/csvfiles",minPartitions=10)

If you want to check the no of partitions, you can run the below statement

files.getNumPartitions()

Note: If you set the minPartitions to less than the no of HDFS blocks, spark will automatically set the min partitions to the no of hdfs blocks and doesn't give any error.

.

Please "Accept" the answer if this helps or revert back for any questions.

.

-Aditya