Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

How to run Spark job from Oozie Workflow on HDP/hue

avatar
Expert Contributor

I have created a small java program for Spark. It works with "spark-submit" command. I like to run it from Oozie workflow. It seems HDP 2.3 has a capability to run Spark job from Oozie workflow, but on Hue's GUI, I don't have a choice of Spark job to include into a workflow. How do I do?

1 ACCEPTED SOLUTION

avatar
Expert Contributor

I figured it out by myself. Here is the steps:

1: download sandbox or use your existing sandbox (HDP 2.3.2)

2: create a workflow on Hue's oozie

3: click "Edit Properties" and add a property in Oozie parameters: oozie.action.sharelib.for.spark = spark,hcatalog,hive

4: click Save button

5: add a shell action; fill name field. shell command field may be required; enter whatever any string temporary and save the shell action. We come back to edit it later.

6: Close workflow and open file browser; click oozie, then workspaces. Identify _hue_xxx directory for the workflow you are creating.

7: create lib directory.

8: copy your jar file that contains spark java program.

9: move up the directory and copy shell file (e.g. script.sh) that contains:

spark-submit --class JDBCTest spark-test-1.0.jar

spark-test-1.0.jar is the file you uploaded to lib directory.

10: Go back to workflow web page

11: Open the shell action and set Shell command by selecting shell file (e.g. script.sh)

12: Also populate Files field to add the schell file (e.g. script.sh) again

13: click Done

14: save the workflow

15: submit the workflow

16: it should run.

My java program does like this:

Statement stmt = con.createStatement();

String sql = "SELECT s07.description AS job_category, s07.salary , s08.salary , (s08.salary - s07.salary) AS salary_difference FROM sample_07 s07 JOIN sample_08 s08 ON ( s07.code = s08.code) WHERE s07.salary < s08.salary SORT BY s08.salary-s07.salary DESC LIMIT 5";

ResultSet res = stmt.executeQuery(sql);

It uses hive jdbc driver.

View solution in original post

13 REPLIES 13

avatar
New Member

For your review @Artem Ervits

https://na9.salesforce.com/articles/en_US/How_To/How-to-run-Spark-Action-in-oozie-of-HDP-2-3-0?popup...

While I was not able to see the contents of this link directly (had to have somebody extract for me), perhaps you can as a Hortonworks insider.

avatar
Master Mentor

awesome! I will try this out and maybe publish an article @cmuchinsky

avatar
Expert Contributor

I figured it out by myself. Here is the steps:

1: download sandbox or use your existing sandbox (HDP 2.3.2)

2: create a workflow on Hue's oozie

3: click "Edit Properties" and add a property in Oozie parameters: oozie.action.sharelib.for.spark = spark,hcatalog,hive

4: click Save button

5: add a shell action; fill name field. shell command field may be required; enter whatever any string temporary and save the shell action. We come back to edit it later.

6: Close workflow and open file browser; click oozie, then workspaces. Identify _hue_xxx directory for the workflow you are creating.

7: create lib directory.

8: copy your jar file that contains spark java program.

9: move up the directory and copy shell file (e.g. script.sh) that contains:

spark-submit --class JDBCTest spark-test-1.0.jar

spark-test-1.0.jar is the file you uploaded to lib directory.

10: Go back to workflow web page

11: Open the shell action and set Shell command by selecting shell file (e.g. script.sh)

12: Also populate Files field to add the schell file (e.g. script.sh) again

13: click Done

14: save the workflow

15: submit the workflow

16: it should run.

My java program does like this:

Statement stmt = con.createStatement();

String sql = "SELECT s07.description AS job_category, s07.salary , s08.salary , (s08.salary - s07.salary) AS salary_difference FROM sample_07 s07 JOIN sample_08 s08 ON ( s07.code = s08.code) WHERE s07.salary < s08.salary SORT BY s08.salary-s07.salary DESC LIMIT 5";

ResultSet res = stmt.executeQuery(sql);

It uses hive jdbc driver.

avatar
Expert Contributor

I also tested HiveContext so that Hive processing works in Spark memory. It works.