Support Questions

Find answers, ask questions, and share your expertise

Who agreed with this solution

avatar
Super Collaborator

Update: I got to a working solution, this is a brief Howto to get to the result:

 

 

JOB MAIN BOX CONFIGURATION (CLICK THE "PENCIL" EDIT ICON

ON TOP OF THE WORKFLOW MAIN SCREEN):

Spark Master:			yarn-cluster
Mode:				cluster
App Name:			MySpark
Jars/py files:			hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib/test.py
Main Class:			<WHATEVER_STRING_HERE>  (E.g. "clear", or "org.apache.spark.examples.mllib.JavaALS"). We do not have a Main Class in our ".py" script!
Arguments:			NO ARGUMENTS DEFINED


 

WORKFLOW SETTINGS (CLICK GEAR ICON ON TOP RIGHT OF

THE WORKFLOW MAIN SCREEN):

Variables:			oozie.use.system.libpath --> true
Workspace:			hue-oozie-1463575878.15
Hadoop Properties:		oozie.launcher.yarn.app.mapreduce.am.env --> SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
Show Graph Arrows:		CHECKED
Version:			uri.oozie.workflow.0.5
Job XML:			EMPTY
SLA Configuration:		UNCHECKED

 

JOB DETAILED CONFIGURATION (CLICK THE "PENCIL" EDIT ICON

ON TOP OF THE WORKFLOW MAIN SCREEN AND THE THE TRIANGULAR

ICON ON TOP RIGHT OF THE MAIN JOB BOX TO EDIT IT IN DETAIL):

- PROPERTIES TAB:
-----------------
Options List:			--files hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/hive-site.xml
Prepare:			NO PREPARE STEPS DEFINED
Job XML:			EMPTY
Properties:			NO PROPERTIES DEFINED
Retry:				NO RETRY OPTIONS DEFINED

- SLA TAB:
----------
Enabled:			UNCHECKED

- CREDENTIALS TAB:
------------------
Credentials:			NO CREDENTIALS DEFINED

- TRANSITIONS TAB:
------------------
Ok				End
Ko				Kill

 

 

MANUALLY EDIT MINIMAL "hive-site.xml" FILE TO BE PASSED TO THE SPARK-ON-HIVE

CONTAINER TO BE ABLE TO ACCESS THE TABLES METASTORE FROM WHATEVER

NODE IN THE CLUSTER, AND UPLOAD IT TO HDFS:

vi hive-site.xml

---
<configuration>
	<property>
		<name>hive.metastore.uris</name>
		<value>thrift://<THRIFT_HOSTNAME>:9083</value>
	</property>
</configuration>
---

hdfs dfs -put hive-site.xml /user/hue/oozie/workspaces/hue-oozie-1463575878.15

 

 

EDIT THE PYSPARK SCRIPT AND UPLOAD IT INTO THE "lib" DIRECTORY

IN THE WORKFLOW FOLDER:

vi test.py

---
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import *

sconf = SparkConf().setAppName("MySpark").set("spark.driver.memory", "1g").setMaster("yarn-cluster")
sc = SparkContext(conf=sconf)

sqlCtx = HiveContext(sc)

xxx_DF = sqlCtx.table("table")
yyy_DF = xxx_DF.select("fieldname").saveAsTable("new_table")
---

hdfs dfs -put test.py /user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib

 

NOW YOU CAN SUBLIT THE WORKFLOW IN YARN:

- Click the "PLAY" Submit Icon on top of the screen

 

ADDITIONAL INFO: AUTO-GENERATED "workflow.xml":

<workflow-app name="Spark_on_Oozie" xmlns="uri:oozie:workflow:0.5">
  <global>
            <configuration>
                <property>
                    <name>oozie.launcher.yarn.app.mapreduce.am.env</name>
                    <value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark</value>
                </property>
            </configuration>
  </global>
    <start to="spark-9fa1"/>
    <kill name="Kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <action name="spark-9fa1">
        <spark xmlns="uri:oozie:spark-action:0.1">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <master>yarn-cluster</master>
            <mode>cluster</mode>
            <name>MySpark</name>
              <class>clear</class>
            <jar>hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib/test.py</jar>
              <spark-opts>--files hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/hive-site.xml</spark-opts>
        </spark>
        <ok to="End"/>
        <error to="Kill"/>
    </action>
    <end name="End"/>
</workflow-app>

 

ADDITIONAL INFO: AUTO-GENERATED "job.properties":

oozie.use.system.libpath=true
security_enabled=False
dryrun=False
jobTracker=<JOBTRACKER_HOSTNAME>:8032

 

 

 

 

View solution in original post

Who agreed with this solution