Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark DataFrame Cache


Spark DataFrame Cache


I am having some issue with caching dataframe in spark. I think I am missing some peice to understand the behaviour I am seeing.
So please help me understand this.

(step 1). I am reading hive table as a dataframe and performing count as an action. Lets say we have the count 2.
(step 2). I am caching this dataframe.
(step 3). I am adding 2 additional records to the hive table.
(step 4). I am doing count on the cached dataframe again.

Since I cached the dataframe in step 2, I am expecting, the count in step 1 and step 4 should be 2. This is working when I am adding additional records to the table from outside the spark application.
However it is not the same if I am doing step 3 from within the application. I AM NOT UNDERSTANIDNG WHY.

If I do step 3 from the same application I am getting step 4 count as 4. But why??
Since I cached the dataframe, should not step 4 look into that dataframe rather than going the source again (in this case hive table)


Re: Spark DataFrame Cache

Master Collaborator

To repeat and expand the answer from before, here's what happens:



  • 2 records exist
  • DataFrame is counted; result is 2
  • DataFrame is marked for caching and is not computed
  • 2 records are added; 4 records exist
  • DataFrame is counted
  • DataFrame is computed and cached, with 4 records
  • count results in 4

In general, remember also that a cached entity can be recomputed at any time. If the underlying data changes your results will change even for 'cached' results.

Re: Spark DataFrame Cache




My understanding was. I am quering the table and caching the dataframe. So now every time when I refer to this dataframe it will not go beyond this point ( it will not go to the source), because I have cached that dataframe. 

But I am wrong, as you said cached entity can be recomputed.


So. If I have 2 records in the table when I am starting my application. How can i restrict my application to use only these two records through out my application, even if we are changing the records on the original hive table while the application is running. 

[ my understanding (which is wrong) was I can do this by caching the dataframe]


please forgive me for my lack of understanding

Re: Spark DataFrame Cache

Master Collaborator

Calling cache() does not cause a DataFrame to be computed. Evaluation is lazy in Spark. You can call an action on it before adding the 2 records. Then it will be computed and cached in the state that it has 2 records.


But you can still get a count of 4 later if the DataFrame were recomputed (like if its cached partitions were evicted).


In general, operations assume immutable data, and you're mutating your data.

Re: Spark DataFrame Cache


Yes I realised I missed this part in my reply right after I posted.


1. Dataframe is marked for cache

2. Dataframe is computed with count action. count is 2

3. two records are inserted

3. cached dataframe is recomputed and the count is 4.


I think I am clear on this behaviour. But now the question comes, how can we restrict our application to use only these two records through our my application, even if we are changing the records on the original hive table while the application is running. 

How do we prevent our application to go (or to recompute ) to the source again and again.