Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Sub Partitioning an RDD in pyspark

Highlighted

Sub Partitioning an RDD in pyspark

Explorer

In pyspark using mapPartitionsWithIndex, I can iterate over each partition of rdd and apply action, but I do not want to increase number of partitions beyond 10, instead want to use nested partitioning.

rdd=myfile.map(lambda x:(x,1)).reduceByKey(lambda a,b:a+b)
for part_id in range(rdd.getNumPartitions()):
    part_rdd = rdd.mapPartitionsWithIndex(generator_func(part_id))
    # partition part_rdd further to 10 partitons:
    part_rdd=part_rdd.partitionBy(10,partition_func()) 

I am not getting what to do in partition_func(), I thought of using lambda x:x but it didn't worked. Any suggestions what should i do here?