Support Questions
Find answers, ask questions, and share your expertise

Accessing the data cached in Memory by different Spark jobs





I have a requirement in which I have a Master file (lookup file) and daily files. The Master file may be updated by the daily files.


I have a job A which updates record in the Master File and i have a job B which is tring to access the same Master file and record. 


I am loading the Master file as RDD and Caching it in Memory in job A. Will the job B able to read the Master Look up file  RDD cached in Memory created by job A ? Or is this possible only after the RDD is written to a File and the File should be loaded into Memory by job B again?


I tried using broadcast variables in Spark, but was not able to access the variable across two spark applications.


To summarize,  I want the Master File to be loaded as an RDD and cached in Memory and to be shared across jobs.


Please clarify if this is possible in Spark. Thanks!






Master Collaborator

I think the answer is 'no', but it's not really a question of sharing RDDs, but the fact that RDDs are immutable. Updating a file doesn't change the RDD. You would be reloading it anyway to get updates. However you can always load A, cache it, and run many jobs that use A, all within one SparkContext.





If I understand correctly,RDD's cannot be shared across Spark jobs


Could you please elaborate  if there is any other way to share a lookup file across Spark jobs

Master Collaborator

You can have a look at Tachyon, although I don't think it's ready for anything but experiments now. You can perhaps create a long-running service that shares the RDD. But, I think it's far easier to simply declare the RDD each time you need it in a job. Unless you reuse it many times, it's not going to hurt speed much at all. If you do, then you need to write one job to do many tasks from that RDD anyway for performance.



I have a look up file in which records may be updated by jobs if values are missing.


In such a case, I need to lock the file/record and in my use case and many jobs will be running simultaneously, trying to access the same look up file.


Hence, if I have to load into an RDD each time in a Spark job, I may run into synchronization issues.


So, I will think over again on how this can be done in Spark. Any suggestions are welcome.


Thanks again!

Master Collaborator

When you say synchronization issue, do you mean reading a possibly inconsistent state of the file because it is being overwritten? Yes, but that is a problem for any process reading such a file. If that's what you're doing, what you really need is some form of database to update the state of the data atomically. Otherwise I don't see any synchronization issue specific to Spark here.