Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

dataframe count - without re-running it.

dataframe count - without re-running it.

Explorer

Hi,

 

How do I compute dataframe record count without re-running dataframe. I mean can we pull this information from any spark stats table?

 

Few options I am aware of are:

1. dataframe.cache() -- Don't want to store result in memory.

2. dataframe.describe("col").show -- again it will re-run the dataframe to get count.

3. dataframe.count().show() -- again it will re-run the dataframe to get count.

 

Thanks!

 

Thanks,
Waseem

 

2 REPLIES 2
Highlighted

Re: dataframe count - without re-running it.

Expert Contributor

can you try with dataframe.persist(DISK_ONLY)? This stores the result on disk.

Highlighted

Re: dataframe count - without re-running it.

Explorer

Hi,

Thanks for reply. but it will spend time in writing dataframe into disk. I am looking for an option like is there a way dataframe counts can be displayed into logs, with spark logger option.

 

intention is to avoid re-running dataframe.

Don't have an account?
Coming from Hortonworks? Activate your account here