Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to run another JVM within a SPARK job

Highlighted

How to run another JVM within a SPARK job

Explorer

Hi,

I have this requirement of initiating another map-red job (using hive in the same cluster) using the info from each RDD.

I need to know if I would be able to run a JVM (doing the same hive mapred job) inside a SPARK job, while processing each RDD. If this is possible what is the procedure to achieve this. Any sample code/ documentation would be helpful.

I use the HDP sandbox with:

Hadoop version : 2.7.3.2.6.0.3-8

Spark Version : 2.1.0.2.6.0.3-8

Hive version: 2.1.0.2.6.0.3-8

Thanks.

2 REPLIES 2
Highlighted

Re: How to run another JVM within a SPARK job

Explorer

Hi Opao,

I'm not sure I follow your thinking here, so let me re-state the problem: you have a spark program (written in scala? java?) that needs to take data from an RDD and use it as input to a query against Hive (so not really map-reduce?) on the same cluster and then use the response of the query in your spark program, yes? Why the need to spawn a new JVM? It seems like you could use sparkSQL, spawn a Hive context and execute the query inline. could you elaborate? bob

Highlighted

Re: How to run another JVM within a SPARK job

Explorer

Hi Bob,

The Data in the RDD needs to use the hive-streaming API (Which internally uses map-reduce) which uses following classes:

org.apache.hive.hcatalog.streaming.{HiveEndPoint, StreamingConnection, StrictJsonWriter, TransactionBatch}

so is this possible to be invoked from a spawned hive context, instead of being invoked as a separate JVM ?

Thanks.

Don't have an account?
Coming from Hortonworks? Activate your account here