Support Questions
Find answers, ask questions, and share your expertise
Alert: The Cloudera Community will undergo maintenance on Saturday, August 17 at 12:00am PDT. See more info here.

Take vs Count performance


Take vs Count performance

New Contributor

I was going through some pages for Spark practices and found this page:


If I had to check if a dataframe has at least 10 entries, would it be better to do df.count() >= 10 or df.take(10).length <10 ?


I tried both methods and didn't find there to be a difference in performance, so I'm wondering what the logic is behind that post I linked?


These are large dataframes that have some fairly complex transformations before the count/take is called, so the count/take action can take a very long time to complete.