Support Questions
Find answers, ask questions, and share your expertise

How to find top N records per group using pyspark RDD [not by dataframe API]

Contributor

I have a sample dataset like below:-

sample=[(201406,'c',100),(201406,'e',200),(201406,'a',300),(201407,'c',100),(201407,'d',300),(201407,'e',500)]
samplerdd=sc.parallelize(sample)

I was able to get top N records per group through groupByKey() like below which gave me below output:-

samplerdd.map(lambda rec:((rec[0]),(rec[2],rec[1]))).groupByKey().map(lambda rec:((rec[0]),sorted(rec[1],reverse=True)[:2])).collect()
[(201406, [(300, 'a'), (200, 'e')]), (201407, [(500, 'e'), (300, 'd')])]

But how to achive same through other Key API's and what should be the best in case of huge data to avoid shuffle operations?.

Does other API's guarantee global ordering?

1 REPLY 1

Expert Contributor
@Mr. Davy Jones

I found this article to be useful to solve above problem efficient way https://ragrawal.wordpress.com/2015/08/25/pyspark-top-n-records-in-each-group/