Created 12-21-2015 03:53 AM
I have created a small java program for Spark. It works with "spark-submit" command. I like to run it from Oozie workflow. It seems HDP 2.3 has a capability to run Spark job from Oozie workflow, but on Hue's GUI, I don't have a choice of Spark job to include into a workflow. How do I do?
Created 01-29-2016 08:44 PM
I figured it out by myself. Here is the steps:
1: download sandbox or use your existing sandbox (HDP 2.3.2)
2: create a workflow on Hue's oozie
3: click "Edit Properties" and add a property in Oozie parameters: oozie.action.sharelib.for.spark = spark,hcatalog,hive
4: click Save button
5: add a shell action; fill name field. shell command field may be required; enter whatever any string temporary and save the shell action. We come back to edit it later.
6: Close workflow and open file browser; click oozie, then workspaces. Identify _hue_xxx directory for the workflow you are creating.
7: create lib directory.
8: copy your jar file that contains spark java program.
9: move up the directory and copy shell file (e.g. script.sh) that contains:
spark-submit --class JDBCTest spark-test-1.0.jar
spark-test-1.0.jar is the file you uploaded to lib directory.
10: Go back to workflow web page
11: Open the shell action and set Shell command by selecting shell file (e.g. script.sh)
12: Also populate Files field to add the schell file (e.g. script.sh) again
13: click Done
14: save the workflow
15: submit the workflow
16: it should run.
My java program does like this:
Statement stmt = con.createStatement();
String sql = "SELECT s07.description AS job_category, s07.salary , s08.salary , (s08.salary - s07.salary) AS salary_difference FROM sample_07 s07 JOIN sample_08 s08 ON ( s07.code = s08.code) WHERE s07.salary < s08.salary SORT BY s08.salary-s07.salary DESC LIMIT 5";
ResultSet res = stmt.executeQuery(sql);
It uses hive jdbc driver.
Created 02-03-2016 04:40 PM
For your review @Artem Ervits
While I was not able to see the contents of this link directly (had to have somebody extract for me), perhaps you can as a Hortonworks insider.
Created 02-03-2016 04:45 PM
awesome! I will try this out and maybe publish an article @cmuchinsky
Created 01-29-2016 08:44 PM
I figured it out by myself. Here is the steps:
1: download sandbox or use your existing sandbox (HDP 2.3.2)
2: create a workflow on Hue's oozie
3: click "Edit Properties" and add a property in Oozie parameters: oozie.action.sharelib.for.spark = spark,hcatalog,hive
4: click Save button
5: add a shell action; fill name field. shell command field may be required; enter whatever any string temporary and save the shell action. We come back to edit it later.
6: Close workflow and open file browser; click oozie, then workspaces. Identify _hue_xxx directory for the workflow you are creating.
7: create lib directory.
8: copy your jar file that contains spark java program.
9: move up the directory and copy shell file (e.g. script.sh) that contains:
spark-submit --class JDBCTest spark-test-1.0.jar
spark-test-1.0.jar is the file you uploaded to lib directory.
10: Go back to workflow web page
11: Open the shell action and set Shell command by selecting shell file (e.g. script.sh)
12: Also populate Files field to add the schell file (e.g. script.sh) again
13: click Done
14: save the workflow
15: submit the workflow
16: it should run.
My java program does like this:
Statement stmt = con.createStatement();
String sql = "SELECT s07.description AS job_category, s07.salary , s08.salary , (s08.salary - s07.salary) AS salary_difference FROM sample_07 s07 JOIN sample_08 s08 ON ( s07.code = s08.code) WHERE s07.salary < s08.salary SORT BY s08.salary-s07.salary DESC LIMIT 5";
ResultSet res = stmt.executeQuery(sql);
It uses hive jdbc driver.
Created 01-29-2016 10:41 PM
I also tested HiveContext so that Hive processing works in Spark memory. It works.