Could someone tell me the answer of below question, why and how?
Q. How many partitions shall "intialiy" be created with the following command on spark shell- sc.textfile("hdfs://user/cloudera/csvfiles") There are 100 files in directory /user/cloudera/csvfiles and there are 10 nodes running Spark. a.1 b.10 c.20 d.100
textFile() partitions based on the number of HDFS blocks the file uses. If the file is only 1 block, then RDD is initialized with minimum of 2 partitions. If you want to increase the minimum no of partitions then you can pass an argument for it like below
files = sc.textfile("hdfs://user/cloudera/csvfiles",minPartitions=10)
If you want to check the no of partitions, you can run the below statement
Note: If you set the minPartitions to less than the no of HDFS blocks, spark will automatically set the min partitions to the no of hdfs blocks and doesn't give any error.
Please "Accept" the answer if this helps or revert back for any questions.