08-14-2015 07:21 AM - edited 08-14-2015 07:34 AM
Hi everyone !
I would like to write 2 small Spark applications:
Is this possible? I can't figure how to do that orwhat functionality should be used, as I don't want the RDD to be persisted on disk...
Thanks for your tips,
08-18-2015 06:44 AM
I tried to use Hive MetaStore to persist a table. It would have been filled by the first application and read by the second one. Problem is that the table is physically stored, that's to say on disk and not in memory.
If both my applications works using Spark RDDs and store data in memory, I would like not to use disk to store a persistent RDD in case of my 2 applications are called in a batch (or by a main application) sequentially, without time loss.
08-18-2015 06:53 AM
An RDD is bound to an application, so it can't be shared across apps. You simply persist the data (e.g. on HDFS) and read it from the other app as an RDD.
I know people think that is slow, or slow-er than sharing an RDD somehow, but it isn't if you think about what's necessary to maintain fault tolerance across apps. You'd still be persisting something somewhere besides memory. And HDFS caching can make a lot of the reading from HDFS an in-mem operation anyway.
08-18-2015 07:17 AM
Thanks for this argumented explanation. As you might suppose, I'm a beginer in this world and those explanations reinsure me a lot :).
I can have 2 behaviors for my 2 applications:
- App A load data from HDFS and transform it. A moment later (maybe hours, maybe days), App B load these data and validate it.
- App 1 load data from HDFS and transform it. Immediately after, App B load these data and validate it.
So, for second behavior, I believe that - on the paper - it would not be optimized to store data in HDFS and to read it again immediately. Even if this is not so slow, even if it maintain fault tolerance, etc.
Anyway, I will try to use HDFS persistence and check performances. I might not have necessary skills to fine tune Hadoop memory caching but, with default settings, it may be good enough for my needs. Again, thanks for your explanations on this lead.
Maybe should I search if a Spark application can call another one. In this case, I could write a third application (a master one) that would launch applications A and B sequentially and manage the RDD at the master level, passing it to applications A and B so that they update it... But this design bothers me, I don't like this idea. And I don't know if this is feasable.
Another lead I found, and I believe this is the one to be followed if I choose not to persist RDD on HDFS, is the OFF_HEAP experimental persistence using Tachyon. I will try to play a bit with Tachyon and see if something can be done this way, without relying on Spark RDD OFF_HEAP persistence as this is an experimental feature.
08-18-2015 07:30 AM
It sounds like you want to have one process, not two then, if the two phases are so tied together.
Also consider using a message queue like kafka and spark streaming to process the output of one separate job in another in near-real-time.
I would not over-complicate it.
Tachyon is also an option but as far as I know it's not necessarily finished or completely integrate with Spark. I don't know if it will be.
08-20-2015 08:15 AM
Hello there :)
I tested HDFS performances and I admit it may be sufficient for my needs, thanks for the lead!
Moreover, as Kafkya and Tachyon integrations are still experimental and as this is some big stuff, I searched something else and found the spark-jobserver project that may exactly be what I need: a server Spark application opens the SparkContext and manages RDD for client Spark applications. It may do the stuff, I'll look at this.
09-01-2015 12:19 AM
I made a choice: Spark-JobServer. This project is almost done exactly in response to my needs, it allows to share RRD between applications as it shares a context. It supports Spark Sql/Hive contexts. And it is fully working without the need to install a new component on all cluster nodes :)