I have a sample dataset like below:-
I was able to get top N records per group through groupByKey() like below which gave me below output:-
samplerdd.map(lambda rec:((rec),(rec,rec))).groupByKey().map(lambda rec:((rec),sorted(rec,reverse=True)[:2])).collect()
[(201406, [(300, 'a'), (200, 'e')]), (201407, [(500, 'e'), (300, 'd')])]
But how to achive same through other Key API's and what should be the best in case of huge data to avoid shuffle operations?.
Does other API's guarantee global ordering?