Support Questions

Grg · ‎08-14-2015

Hi everyone !

I would like to write 2 small Spark applications:

Application A read a file, transform data, stored transformed data as a RDD in memory
Application B retrieves the RDD in memory and perform some validation stuff accross the RDD data

Is this possible? I can't figure how to do that orwhat functionality should be used, as I don't want the RDD to be persisted on disk...

Thanks for your tips,

Greg.

Grg · ‎09-01-2015

I made a choice: Spark-JobServer. This project is almost done exactly in response to my needs, it allows to share RRD between applications as it shares a context. It supports Spark Sql/Hive contexts. And it is fully working without the need to install a new component on all cluster nodes 🙂

View solution in original post

Grg · ‎08-18-2015

Still diging...

I tried to use Hive MetaStore to persist a table. It would have been filled by the first application and read by the second one. Problem is that the table is physically stored, that's to say on disk and not in memory.

If both my applications works using Spark RDDs and store data in memory, I would like not to use disk to store a persistent RDD in case of my 2 applications are called in a batch (or by a main application) sequentially, without time loss.

srowen · ‎08-18-2015

An RDD is bound to an application, so it can't be shared across apps. You simply persist the data (e.g. on HDFS) and read it from the other app as an RDD.

I know people think that is slow, or slow-er than sharing an RDD somehow, but it isn't if you think about what's necessary to maintain fault tolerance across apps. You'd still be persisting something somewhere besides memory. And HDFS caching can make a lot of the reading from HDFS an in-mem operation anyway.

Grg · ‎08-18-2015

Thanks for this argumented explanation. As you might suppose, I'm a beginer in this world and those explanations reinsure me a lot :).

I can have 2 behaviors for my 2 applications:

- App A load data from HDFS and transform it. A moment later (maybe hours, maybe days), App B load these data and validate it.

- App 1 load data from HDFS and transform it. Immediately after, App B load these data and validate it.

So, for second behavior, I believe that - on the paper - it would not be optimized to store data in HDFS and to read it again immediately. Even if this is not so slow, even if it maintain fault tolerance, etc.

Anyway, I will try to use HDFS persistence and check performances. I might not have necessary skills to fine tune Hadoop memory caching but, with default settings, it may be good enough for my needs. Again, thanks for your explanations on this lead.

Maybe should I search if a Spark application can call another one. In this case, I could write a third application (a master one) that would launch applications A and B sequentially and manage the RDD at the master level, passing it to applications A and B so that they update it... But this design bothers me, I don't like this idea. And I don't know if this is feasable.

Another lead I found, and I believe this is the one to be followed if I choose not to persist RDD on HDFS, is the OFF_HEAP experimental persistence using Tachyon. I will try to play a bit with Tachyon and see if something can be done this way, without relying on Spark RDD OFF_HEAP persistence as this is an experimental feature.

Greg.

srowen · ‎08-18-2015

It sounds like you want to have one process, not two then, if the two phases are so tied together.

Also consider using a message queue like kafka and spark streaming to process the output of one separate job in another in near-real-time.

I would not over-complicate it.

Tachyon is also an option but as far as I know it's not necessarily finished or completely integrate with Spark. I don't know if it will be.

Grg · ‎08-18-2015

I need to be able to chose to have 1 process or 2, that's weird 😉

Thanks for the leads, I will have a look!

Grg · ‎08-20-2015

Hello there 🙂

I tested HDFS performances and I admit it may be sufficient for my needs, thanks for the lead!

Moreover, as Kafkya and Tachyon integrations are still experimental and as this is some big stuff, I searched something else and found the spark-jobserver project that may exactly be what I need: a server Spark application opens the SparkContext and manages RDD for client Spark applications. It may do the stuff, I'll look at this.

https://github.com/spark-jobserver/spark-jobserver

Grg · ‎09-01-2015

I made a choice: Spark-JobServer. This project is almost done exactly in response to my needs, it allows to share RRD between applications as it shares a context. It supports Spark Sql/Hive contexts. And it is fully working without the need to install a new component on all cluster nodes 🙂

Cloudera Community

Support Questions

Share 1 RDD between 2 Spark applications (memory persistence)

Spark RDD/Dataframe caching

HDF 3.1: Executing Apache Spark via ExecuteSparkIn...

Receiving AVRO Messages through KAFKA in a Spark S...

Memory for RDD

Introduction to Apache Spark and Develop Spark App...

How to upload and share your application log

spark 2 and spark 3 on cdp

union rdd with emptyrdd

Running Spark Application on a Kerberized Hadoop c...

getting error while persisting spark output to hiv...