Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Take vs Count performance

Take vs Count performance

New Contributor

I was going through some pages for Spark practices and found this page: https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/dont_use_count_when...

 

If I had to check if a dataframe has at least 10 entries, would it be better to do df.count() >= 10 or df.take(10).length <10 ?

 

I tried both methods and didn't find there to be a difference in performance, so I'm wondering what the logic is behind that post I linked?

 

These are large dataframes that have some fairly complex transformations before the count/take is called, so the count/take action can take a very long time to complete.