I am having some issue with caching dataframe in spark. I think I am missing some peice to understand the behaviour I am seeing.
So please help me understand this.
(step 1). I am reading hive table as a dataframe and performing count as an action. Lets say we have the count 2.
(step 2). I am caching this dataframe.
(step 3). I am adding 2 additional records to the hive table.
(step 4). I am doing count on the cached dataframe again.
Since I cached the dataframe in step 2, I am expecting, the count in step 1 and step 4 should be 2. This is working when I am adding additional records to the table from outside the spark application.
However it is not the same if I am doing step 3 from within the application. I AM NOT UNDERSTANIDNG WHY.
If I do step 3 from the same application I am getting step 4 count as 4. But why??
Since I cached the dataframe, should not step 4 look into that dataframe rather than going the source again (in this case hive table)
To repeat and expand the answer from before, here's what happens:
In general, remember also that a cached entity can be recomputed at any time. If the underlying data changes your results will change even for 'cached' results.
My understanding was. I am quering the table and caching the dataframe. So now every time when I refer to this dataframe it will not go beyond this point ( it will not go to the source), because I have cached that dataframe.
But I am wrong, as you said cached entity can be recomputed.
So. If I have 2 records in the table when I am starting my application. How can i restrict my application to use only these two records through out my application, even if we are changing the records on the original hive table while the application is running.
[ my understanding (which is wrong) was I can do this by caching the dataframe]
please forgive me for my lack of understanding
Calling cache() does not cause a DataFrame to be computed. Evaluation is lazy in Spark. You can call an action on it before adding the 2 records. Then it will be computed and cached in the state that it has 2 records.
But you can still get a count of 4 later if the DataFrame were recomputed (like if its cached partitions were evicted).
In general, operations assume immutable data, and you're mutating your data.
Yes I realised I missed this part in my reply right after I posted.
1. Dataframe is marked for cache
2. Dataframe is computed with count action. count is 2
3. two records are inserted
3. cached dataframe is recomputed and the count is 4.
I think I am clear on this behaviour. But now the question comes, how can we restrict our application to use only these two records through our my application, even if we are changing the records on the original hive table while the application is running.
How do we prevent our application to go (or to recompute ) to the source again and again.