Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: The Cloudera Community will undergo maintenance on Saturday, August 17 at 12:00am PDT. See more info here.

Take vs Count performance

Highlighted

Take vs Count performance

New Contributor

I was going through some pages for Spark practices and found this page: https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/dont_use_count_when...

 

If I had to check if a dataframe has at least 10 entries, would it be better to do df.count() >= 10 or df.take(10).length <10 ?

 

I tried both methods and didn't find there to be a difference in performance, so I'm wondering what the logic is behind that post I linked?

 

These are large dataframes that have some fairly complex transformations before the count/take is called, so the count/take action can take a very long time to complete.