Reply
Explorer
Posts: 13
Registered: ‎04-18-2016

Sub Partitioning an RDD in pyspark

In pyspark using mapPartitionsWithIndex, I can iterate over each partition of rdd and apply action, but I do not want to increase number of partitions beyond 10, instead want to use nested partitioning.

rdd=myfile.map(lambda x:(x,1)).reduceByKey(lambda a,b:a+b)
for part_id in range(rdd.getNumPartitions()):
    part_rdd = rdd.mapPartitionsWithIndex(generator_func(part_id))
    # partition part_rdd further to 10 partitons:
    part_rdd=part_rdd.partitionBy(10,partition_func()) 

I am not getting what to do in partition_func(), I thought of using lambda x:x but it didn't worked. Any suggestions what should i do here?

Announcements