I have a sample dataset like below:-
sample=[(201406,'c',100),(201406,'e',200),(201406,'a',300),(201407,'c',100),(201407,'d',300),(201407,'e',500)]
samplerdd=sc.parallelize(sample)
I was able to get top N records per group through groupByKey() like below which gave me below output:-
samplerdd.map(lambda rec:((rec[0]),(rec[2],rec[1]))).groupByKey().map(lambda rec:((rec[0]),sorted(rec[1],reverse=True)[:2])).collect()
[(201406, [(300, 'a'), (200, 'e')]), (201407, [(500, 'e'), (300, 'd')])]
But how to achive same through other Key API's and what should be the best in case of huge data to avoid shuffle operations?.
Does other API's guarantee global ordering?