03-14-2018 09:33 AM
I was going through some pages for Spark practices and found this page: https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/dont_use_count_when...
If I had to check if a dataframe has at least 10 entries, would it be better to do df.count() >= 10 or df.take(10).length <10 ?
I tried both methods and didn't find there to be a difference in performance, so I'm wondering what the logic is behind that post I linked?
These are large dataframes that have some fairly complex transformations before the count/take is called, so the count/take action can take a very long time to complete.