Apart from specifying the no of partitions when creating a DF or using coalesce/re-partitions, Is there any parameter where we can change the configurations or parameter so default RDD partitions(200) can be reduced.
@Dinesh Chitlangia could you help me with this.
Spark provides multiple API to achieve repartition. 1. RDD.groupBy(num_partition)2. RDD.repartition(num_partition)3. RDD.coalesce(num_partition)If you want to decrease the number of partition in the first RDD. 1. Try writing a CombinedInputFileFormat() which will ensure less partitions(tasks)
* coalesce is more efficient than repartitions as it ensures less shuffling of data.