Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

SPARK number of partitions/tasks while reading a file

SPARK number of partitions/tasks while reading a file


Could someone tell me the answer of below question, why and how?

Q. How many partitions shall "intialiy" be created with the following command on spark shell- sc.textfile("hdfs://user/cloudera/csvfiles") There are 100 files in directory /user/cloudera/csvfiles and there are 10 nodes running Spark. a.1 b.10 c.20 d.100


Re: SPARK number of partitions/tasks while reading a file

@Sandeep Ahuja,

textFile() partitions based on the number of HDFS blocks the file uses. If the file is only 1 block, then RDD is initialized with minimum of 2 partitions. If you want to increase the minimum no of partitions then you can pass an argument for it like below

files = sc.textfile("hdfs://user/cloudera/csvfiles",minPartitions=10)

If you want to check the no of partitions, you can run the below statement


Note: If you set the minPartitions to less than the no of HDFS blocks, spark will automatically set the min partitions to the no of hdfs blocks and doesn't give any error.


Please "Accept" the answer if this helps or revert back for any questions.



Don't have an account?
Coming from Hortonworks? Activate your account here