Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

RDD not fully persisting

RDD not fully persisting

New Contributor

I'm running into an issue with my RDD where I persist it (and use a count() to activate the persistence) and the entire RDD doesn't end up in memory until I query the RDD multiple times. This makes the first few runs extremely slow. Has anyone run into this issue and if so how did you fix it?

 

Thank you

1 REPLY 1
Highlighted

Re: RDD not fully persisting

Master Collaborator
If you mean persisting in memory, then one reason of course is not
having enough memory to fully cache it. If you have enough memory,
then the key is to see that it doesn't actually wait for partitions to
cache to compute the result. If you run several times you should find
each subsequent run finds more of the partitions cached, and so runs
faster, but each will complete without waiting for everything to
serialize to the cache.

I'm not sure that's an issue at all. Your first run necessarily can't
use the cache. It would be slower if you waited for the thing to cache
fully before answering.