Reply
Highlighted
New Contributor
Posts: 1
Registered: ‎03-14-2018

Take vs Count performance

I was going through some pages for Spark practices and found this page: https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/dont_use_count_when...

 

If I had to check if a dataframe has at least 10 entries, would it be better to do df.count() >= 10 or df.take(10).length <10 ?

 

I tried both methods and didn't find there to be a difference in performance, so I'm wondering what the logic is behind that post I linked?

 

These are large dataframes that have some fairly complex transformations before the count/take is called, so the count/take action can take a very long time to complete.

Announcements