Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Share 1 RDD between 2 Spark applications (memory persistence)

avatar
Contributor

Hi everyone !

 

I would like to write 2 small Spark applications:

  • Application A read a file, transform data, stored transformed data as a RDD in memory
  • Application B retrieves the RDD in memory and perform some validation stuff accross the RDD data

Is this possible? I can't figure how to do that orwhat functionality should be used, as I don't want the RDD to be persisted on disk...

 

Thanks for your tips,

Greg.

1 ACCEPTED SOLUTION

avatar
Contributor

I made a choice: Spark-JobServer. This project is almost done exactly in response to my needs, it allows to share RRD between applications as it shares a context. It supports Spark Sql/Hive contexts. And it is fully working without the need to install a new component on all cluster nodes 🙂

View solution in original post

7 REPLIES 7

avatar
Contributor

Still diging...

 

I tried to use Hive MetaStore to persist a table. It would have been filled by the first application and read by the second one. Problem is that the table is physically stored, that's to say on disk and not in memory.

 

If both my applications works using Spark RDDs and store data in memory, I would like not to use disk to store a persistent RDD in case of my 2 applications are called in a batch (or by a main application) sequentially, without time loss.

avatar
Master Collaborator

An RDD is bound to an application, so it can't be shared across apps. You simply persist the data (e.g. on HDFS) and read it from the other app as an RDD.

 

I know people think that is slow, or slow-er than sharing an RDD somehow, but it isn't if you think about what's necessary to maintain fault tolerance across apps. You'd still be persisting something somewhere besides memory. And HDFS caching can make a lot of the reading from HDFS an in-mem operation anyway.

avatar
Contributor

Thanks for this argumented explanation. As you might suppose,  I'm a beginer in this world and those explanations reinsure me a lot :).

 

I can have 2 behaviors for my 2 applications:

- App A load data from HDFS and transform it. A moment later (maybe hours, maybe days), App B load these data and validate it.

- App 1 load data from HDFS and transform it. Immediately after, App B load these data and validate it.

So, for second behavior, I believe that - on the paper - it would not be optimized to store data in HDFS and to read it again immediately. Even if this is not so slow, even if it maintain fault tolerance, etc.

Anyway, I will try to use HDFS persistence and check performances. I might not have necessary skills to fine tune Hadoop memory caching but, with default settings, it may be good enough for my needs. Again, thanks for your explanations on this lead.

 

Maybe should I search if a Spark application can call another one. In this case, I could write a third application (a master one) that would launch applications A and B sequentially and manage the RDD at the master level, passing it to applications A and B so that they update it... But this design bothers me, I don't like this idea. And I don't know if this is feasable.

 

Another lead I found, and I believe this is the one to be followed if I choose not to persist RDD on HDFS, is the OFF_HEAP experimental persistence using Tachyon. I will try to play a bit with Tachyon and see if something can be done this way, without relying on Spark RDD OFF_HEAP persistence as this is an experimental feature.

 

Greg.

 

avatar
Master Collaborator

It sounds like you want to have one process, not two then, if the two phases are so tied together.

Also consider using a message queue like kafka and spark streaming to process the output of one separate job in another in near-real-time.

I would not over-complicate it.

 

Tachyon is also an option but as far as I know it's not necessarily finished or completely integrate with Spark. I don't know if it will be.

avatar
Contributor

I need to be able to chose to have 1 process or 2, that's weird 😉

Thanks for the leads, I will have a look!

avatar
Contributor

Hello there 🙂

 

I tested HDFS performances and I admit it may be sufficient for my needs, thanks for the lead!

 

Moreover, as Kafkya and Tachyon integrations are still experimental and as this is some big stuff, I searched something else and found the spark-jobserver project that may exactly be what I need: a server Spark application opens the SparkContext and manages RDD for client Spark applications. It may do the stuff, I'll look at this.

https://github.com/spark-jobserver/spark-jobserver

avatar
Contributor

I made a choice: Spark-JobServer. This project is almost done exactly in response to my needs, it allows to share RRD between applications as it shares a context. It supports Spark Sql/Hive contexts. And it is fully working without the need to install a new component on all cluster nodes 🙂