Support Questions

Find answers, ask questions, and share your expertise

How to run Spark job from Oozie Workflow on HDP/hue

avatar
Expert Contributor

I have created a small java program for Spark. It works with "spark-submit" command. I like to run it from Oozie workflow. It seems HDP 2.3 has a capability to run Spark job from Oozie workflow, but on Hue's GUI, I don't have a choice of Spark job to include into a workflow. How do I do?

1 ACCEPTED SOLUTION

avatar
Expert Contributor

I figured it out by myself. Here is the steps:

1: download sandbox or use your existing sandbox (HDP 2.3.2)

2: create a workflow on Hue's oozie

3: click "Edit Properties" and add a property in Oozie parameters: oozie.action.sharelib.for.spark = spark,hcatalog,hive

4: click Save button

5: add a shell action; fill name field. shell command field may be required; enter whatever any string temporary and save the shell action. We come back to edit it later.

6: Close workflow and open file browser; click oozie, then workspaces. Identify _hue_xxx directory for the workflow you are creating.

7: create lib directory.

8: copy your jar file that contains spark java program.

9: move up the directory and copy shell file (e.g. script.sh) that contains:

spark-submit --class JDBCTest spark-test-1.0.jar

spark-test-1.0.jar is the file you uploaded to lib directory.

10: Go back to workflow web page

11: Open the shell action and set Shell command by selecting shell file (e.g. script.sh)

12: Also populate Files field to add the schell file (e.g. script.sh) again

13: click Done

14: save the workflow

15: submit the workflow

16: it should run.

My java program does like this:

Statement stmt = con.createStatement();

String sql = "SELECT s07.description AS job_category, s07.salary , s08.salary , (s08.salary - s07.salary) AS salary_difference FROM sample_07 s07 JOIN sample_08 s08 ON ( s07.code = s08.code) WHERE s07.salary < s08.salary SORT BY s08.salary-s07.salary DESC LIMIT 5";

ResultSet res = stmt.executeQuery(sql);

It uses hive jdbc driver.

View solution in original post

13 REPLIES 13

avatar
Explorer

For your review @Artem Ervits

https://na9.salesforce.com/articles/en_US/How_To/How-to-run-Spark-Action-in-oozie-of-HDP-2-3-0?popup...

While I was not able to see the contents of this link directly (had to have somebody extract for me), perhaps you can as a Hortonworks insider.

avatar
Master Mentor

awesome! I will try this out and maybe publish an article @cmuchinsky

avatar
Expert Contributor

I figured it out by myself. Here is the steps:

1: download sandbox or use your existing sandbox (HDP 2.3.2)

2: create a workflow on Hue's oozie

3: click "Edit Properties" and add a property in Oozie parameters: oozie.action.sharelib.for.spark = spark,hcatalog,hive

4: click Save button

5: add a shell action; fill name field. shell command field may be required; enter whatever any string temporary and save the shell action. We come back to edit it later.

6: Close workflow and open file browser; click oozie, then workspaces. Identify _hue_xxx directory for the workflow you are creating.

7: create lib directory.

8: copy your jar file that contains spark java program.

9: move up the directory and copy shell file (e.g. script.sh) that contains:

spark-submit --class JDBCTest spark-test-1.0.jar

spark-test-1.0.jar is the file you uploaded to lib directory.

10: Go back to workflow web page

11: Open the shell action and set Shell command by selecting shell file (e.g. script.sh)

12: Also populate Files field to add the schell file (e.g. script.sh) again

13: click Done

14: save the workflow

15: submit the workflow

16: it should run.

My java program does like this:

Statement stmt = con.createStatement();

String sql = "SELECT s07.description AS job_category, s07.salary , s08.salary , (s08.salary - s07.salary) AS salary_difference FROM sample_07 s07 JOIN sample_08 s08 ON ( s07.code = s08.code) WHERE s07.salary < s08.salary SORT BY s08.salary-s07.salary DESC LIMIT 5";

ResultSet res = stmt.executeQuery(sql);

It uses hive jdbc driver.

avatar
Expert Contributor

I also tested HiveContext so that Hive processing works in Spark memory. It works.