Created 05-26-2016 04:05 AM
Update: I got to a working solution, this is a brief Howto to get to the result:
JOB MAIN BOX CONFIGURATION (CLICK THE "PENCIL" EDIT ICON
ON TOP OF THE WORKFLOW MAIN SCREEN):
Spark Master: yarn-cluster Mode: cluster App Name: MySpark Jars/py files: hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib/test.py Main Class: <WHATEVER_STRING_HERE> (E.g. "clear", or "org.apache.spark.examples.mllib.JavaALS"). We do not have a Main Class in our ".py" script! Arguments: NO ARGUMENTS DEFINED
WORKFLOW SETTINGS (CLICK GEAR ICON ON TOP RIGHT OF
THE WORKFLOW MAIN SCREEN):
Variables: oozie.use.system.libpath --> true Workspace: hue-oozie-1463575878.15 Hadoop Properties: oozie.launcher.yarn.app.mapreduce.am.env --> SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark Show Graph Arrows: CHECKED Version: uri.oozie.workflow.0.5 Job XML: EMPTY SLA Configuration: UNCHECKED
JOB DETAILED CONFIGURATION (CLICK THE "PENCIL" EDIT ICON
ON TOP OF THE WORKFLOW MAIN SCREEN AND THE THE TRIANGULAR
ICON ON TOP RIGHT OF THE MAIN JOB BOX TO EDIT IT IN DETAIL):
- PROPERTIES TAB: ----------------- Options List: --files hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/hive-site.xml Prepare: NO PREPARE STEPS DEFINED Job XML: EMPTY Properties: NO PROPERTIES DEFINED Retry: NO RETRY OPTIONS DEFINED - SLA TAB: ---------- Enabled: UNCHECKED - CREDENTIALS TAB: ------------------ Credentials: NO CREDENTIALS DEFINED - TRANSITIONS TAB: ------------------ Ok End Ko Kill
MANUALLY EDIT MINIMAL "hive-site.xml" FILE TO BE PASSED TO THE SPARK-ON-HIVE
CONTAINER TO BE ABLE TO ACCESS THE TABLES METASTORE FROM WHATEVER
NODE IN THE CLUSTER, AND UPLOAD IT TO HDFS:
vi hive-site.xml --- <configuration> <property> <name>hive.metastore.uris</name> <value>thrift://<THRIFT_HOSTNAME>:9083</value> </property> </configuration> --- hdfs dfs -put hive-site.xml /user/hue/oozie/workspaces/hue-oozie-1463575878.15
EDIT THE PYSPARK SCRIPT AND UPLOAD IT INTO THE "lib" DIRECTORY
IN THE WORKFLOW FOLDER:
vi test.py --- from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext from pyspark.sql import HiveContext from pyspark.sql.functions import * sconf = SparkConf().setAppName("MySpark").set("spark.driver.memory", "1g").setMaster("yarn-cluster") sc = SparkContext(conf=sconf) sqlCtx = HiveContext(sc) xxx_DF = sqlCtx.table("table") yyy_DF = xxx_DF.select("fieldname").saveAsTable("new_table") --- hdfs dfs -put test.py /user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib
NOW YOU CAN SUBLIT THE WORKFLOW IN YARN:
- Click the "PLAY" Submit Icon on top of the screen
ADDITIONAL INFO: AUTO-GENERATED "workflow.xml":
<workflow-app name="Spark_on_Oozie" xmlns="uri:oozie:workflow:0.5"> <global> <configuration> <property> <name>oozie.launcher.yarn.app.mapreduce.am.env</name> <value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark</value> </property> </configuration> </global> <start to="spark-9fa1"/> <kill name="Kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <action name="spark-9fa1"> <spark xmlns="uri:oozie:spark-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <master>yarn-cluster</master> <mode>cluster</mode> <name>MySpark</name> <class>clear</class> <jar>hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib/test.py</jar> <spark-opts>--files hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/hive-site.xml</spark-opts> </spark> <ok to="End"/> <error to="Kill"/> </action> <end name="End"/> </workflow-app>
ADDITIONAL INFO: AUTO-GENERATED "job.properties":
oozie.use.system.libpath=true security_enabled=False dryrun=False jobTracker=<JOBTRACKER_HOSTNAME>:8032