Created 07-31-2017 12:20 PM
Hi,
I have this requirement of initiating another map-red job (using hive in the same cluster) using the info from each RDD.
I need to know if I would be able to run a JVM (doing the same hive mapred job) inside a SPARK job, while processing each RDD. If this is possible what is the procedure to achieve this. Any sample code/ documentation would be helpful.
I use the HDP sandbox with:
Hadoop version : 2.7.3.2.6.0.3-8
Spark Version : 2.1.0.2.6.0.3-8
Hive version: 2.1.0.2.6.0.3-8
Thanks.
Created 08-02-2017 04:53 AM
Hi Opao,
I'm not sure I follow your thinking here, so let me re-state the problem: you have a spark program (written in scala? java?) that needs to take data from an RDD and use it as input to a query against Hive (so not really map-reduce?) on the same cluster and then use the response of the query in your spark program, yes? Why the need to spawn a new JVM? It seems like you could use sparkSQL, spawn a Hive context and execute the query inline. could you elaborate? bob
Created 08-02-2017 10:13 AM
Hi Bob,
The Data in the RDD needs to use the hive-streaming API (Which internally uses map-reduce) which uses following classes:
org.apache.hive.hcatalog.streaming.{HiveEndPoint, StreamingConnection, StrictJsonWriter, TransactionBatch}
so is this possible to be invoked from a spawned hive context, instead of being invoked as a separate JVM ?
Thanks.