New Contributor
Posts: 5
Registered: ‎03-02-2016

RDD not fully persisting

I'm running into an issue with my RDD where I persist it (and use a count() to activate the persistence) and the entire RDD doesn't end up in memory until I query the RDD multiple times. This makes the first few runs extremely slow. Has anyone run into this issue and if so how did you fix it?


Thank you

Cloudera Employee
Posts: 366
Registered: ‎07-29-2013

Re: RDD not fully persisting

If you mean persisting in memory, then one reason of course is not
having enough memory to fully cache it. If you have enough memory,
then the key is to see that it doesn't actually wait for partitions to
cache to compute the result. If you run several times you should find
each subsequent run finds more of the partitions cached, and so runs
faster, but each will complete without waiting for everything to
serialize to the cache.

I'm not sure that's an issue at all. Your first run necessarily can't
use the cache. It would be slower if you waited for the thing to cache
fully before answering.