Created 12-21-2015 03:53 AM
I have created a small java program for Spark. It works with "spark-submit" command. I like to run it from Oozie workflow. It seems HDP 2.3 has a capability to run Spark job from Oozie workflow, but on Hue's GUI, I don't have a choice of Spark job to include into a workflow. How do I do?
Created 01-29-2016 08:44 PM
I figured it out by myself. Here is the steps:
1: download sandbox or use your existing sandbox (HDP 2.3.2)
2: create a workflow on Hue's oozie
3: click "Edit Properties" and add a property in Oozie parameters: oozie.action.sharelib.for.spark = spark,hcatalog,hive
4: click Save button
5: add a shell action; fill name field. shell command field may be required; enter whatever any string temporary and save the shell action. We come back to edit it later.
6: Close workflow and open file browser; click oozie, then workspaces. Identify _hue_xxx directory for the workflow you are creating.
7: create lib directory.
8: copy your jar file that contains spark java program.
9: move up the directory and copy shell file (e.g. script.sh) that contains:
spark-submit --class JDBCTest spark-test-1.0.jar
spark-test-1.0.jar is the file you uploaded to lib directory.
10: Go back to workflow web page
11: Open the shell action and set Shell command by selecting shell file (e.g. script.sh)
12: Also populate Files field to add the schell file (e.g. script.sh) again
13: click Done
14: save the workflow
15: submit the workflow
16: it should run.
My java program does like this:
Statement stmt = con.createStatement();
String sql = "SELECT s07.description AS job_category, s07.salary , s08.salary , (s08.salary - s07.salary) AS salary_difference FROM sample_07 s07 JOIN sample_08 s08 ON ( s07.code = s08.code) WHERE s07.salary < s08.salary SORT BY s08.salary-s07.salary DESC LIMIT 5";
ResultSet res = stmt.executeQuery(sql);
It uses hive jdbc driver.
Created 12-21-2015 09:27 AM
Please refer this - https://developer.ibm.com/hadoop/blog/2015/11/05/r...
Created 12-21-2015 07:59 PM
I'm new with HDP/Big Data environment and understand what it is described, but I don't know how I should interpret it into HDP 2.3 environment. Also, I would like to run it from Hue's GUI Oozie workflow editor. Could you explain step by step?
Thanks a lot.
Created 12-21-2015 04:38 PM
There is a bug which requires you to manually copy hive/hcat jars into spark shared lib dir in order to get this to work:
Created 01-29-2016 10:41 PM
It looks like HDP 2.3.2 already has this patch.
Created 01-31-2016 06:00 PM
@Ali Bajwa @Shigeru Takehara when you specify oozie.action.sharelib.for.spark = spark,hcatalog,hive
it will include those libraries with Spark. The trick I learned a hard way :).
Created 02-03-2016 03:58 PM
I looked that the SparkMain class contained within the oozie-sharelib-spark-4.2.0.2.3.4.1-10.jar that comes with the Spark 1.6 TP, and it does not appear to have the fix for https://issues.apache.org/jira/browse/OOZIE-2277
Created 02-03-2016 03:59 PM
@cmuchinsky Spark Oozie action is not supported in HDP at this moment. It is explicitly stated in our Spark User guide.
Created 02-03-2016 04:18 PM
Understood @Artem Ervits, however your previous comment seems to indicate you have some knowledge of the Oozie 'oozie.action.sharelib.for.spark' property, so I wanted to clear up that the comment by @Shigeru Takehara indicating OOZIE-227 was fixed doesn't seem to jibe with the HDP 2.3.4 or 2.3.4.1-TP deliverables.
While Spark via Oozie isn't officially supported, the Hortonworks Support team had provided us with a procedure to update the Oozie sharelib for Spark to get it working with 2.3.4, however that no longer seems to work with the Spark 1.6 enabled 2.3.4.1-TP version.
Created 02-03-2016 04:21 PM
@cmuchinsky I would love to see the steps engineering provided and in general, just because we don't officially support it, doesn't mean it cannot be done. It just means sometimes you have to dig deeper and with Oozie, I have limited patience :).