Support Questions

Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

How to find top N records per group using pyspark RDD [not by dataframe API]


I have a sample dataset like below:-


I was able to get top N records per group through groupByKey() like below which gave me below output:- rec:((rec[0]),(rec[2],rec[1]))).groupByKey().map(lambda rec:((rec[0]),sorted(rec[1],reverse=True)[:2])).collect()
[(201406, [(300, 'a'), (200, 'e')]), (201407, [(500, 'e'), (300, 'd')])]

But how to achive same through other Key API's and what should be the best in case of huge data to avoid shuffle operations?.

Does other API's guarantee global ordering?


Expert Contributor
@Mr. Davy Jones

I found this article to be useful to solve above problem efficient way

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.