Created on 05-14-2016 08:53 AM - edited 09-16-2022 03:19 AM
Hi all, my CDH test rig is as follows:
CDH 5.5.1
Spark 1.5.0
Oozie 4.1.0
I have successfully created a simple Oozie Workflow that spawns a Spark Action using HUE Interface. My intention is to use Yarn in Cluster mode to run the Workflow/Action.
It's a Python script, which is as follows (just a test):
from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext from pyspark.sql import HiveContext from pyspark.sql.functions import * sconf = SparkConf().setAppName("MySpark").set("spark.driver.memory", "1g").setMaster("yarn-cluster") sc = SparkContext(conf=sconf) ### (1) ALTERNATIVELY USE ONE OF THE FOLLOWING CONTEXT DEFINITIONS: sqlCtx = SQLContext(sc) #sqlCtx = HiveContext(sc) ### (2) IF HIVECONTEXT, EVENTUALLY SET THE DATABASE IN USE (SHOULDN'T BE NECESSARY): #sqlCtx .sql("use default") ### (3) CREATE MAIN DATAFRAME. TRY THE SYNTAXES ALTERNATIVELY, COMBINE WITH DIFFERENT (1): #cronologico_DF = sqlCtx.table("sales_fact") cronologico_DF = sqlCtx.sql("select * from sales_fact") ### (4) ANOTHER DATAFRAME extraction_cronologico_DF = cronologico_DF.select("PRODUCT_KEY") ### (5) USELESS PRINT STATEMENT: print 'a'
When I run the Workflow, a Mapreduce Job is started. Shortly after, a Spark Job is spawned (I can see that from the Job Browser).
The Spark Job fails with the following error (excerpt from the Log File of the Spark Acrion):
py4j.protocol.Py4JJavaError: An error occurred while calling o51.sql. : java.lang.RuntimeException: Table Not Found: sales_fact
This is my "workflow.xml":
<workflow-app name="Churn_2015" xmlns="uri:oozie:workflow:0.5"> <global> <job-xml>hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/hive-site.xml</job-xml> <configuration> <property> <name>oozie.launcher.yarn.app.mapreduce.am.env</name> <value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark</value> </property> </configuration> </global> <start to="spark-3ca0"/> <kill name="Kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <action name="spark-3ca0"> <spark xmlns="uri:oozie:spark-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <job-xml>hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/hive-site.xml</job-xml> <configuration> <property> <name>oozie.use.system.libpath</name> <value>true</value> </property> </configuration> <master>yarn-cluster</master> <mode>cluster</mode> <name>MySpark</name> <class>org.apache.spark.examples.mllib.JavaALS</class> <jar>hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/lib/test.py</jar> </spark> <ok to="End"/> <error to="Kill"/> </action> <end name="End"/> </workflow-app>
This is my "job.properties":
oozie.use.system.libpath=True security_enabled=False dryrun=False jobTracker=<MY_SERVER_FQDN_HERE>:8032 nameNode=hdfs://<MY_SERVER_FQDN_HERE>:8020
Please note that:
1) I've also uploaded "hive-site.xml" in the same directory as the 2 files described above. As you can see from "workflow.xml", it should also be picked up.
2) The "test.py" script is under a "lib" directory in the Workspace created by HUE. It gets picked up. In that directory I also took care of uploading several Jars belonging to some Derby DB Connector, probably required to collect Stats, to avoid other exceptions being throwed.
3) I've tried adding a workflow property "oozie.action.sharelib.for.spark", with value "hcatalog,hive,hive2", with no success
4) As you can see in the Python Script described above, I've been using alternatively an SQLContext or a HiveContext object inside the Script. Results are the same (the error message is slightly different though).
5) ShareLib should be OK too:
oozie admin -shareliblist [Available ShareLib] oozie hive distcp hcatalog sqoop mapreduce-streaming spark hive2 pig
I'm suspecting the Tables Metastore is not being read, that's probably the issue. But I ran out of ideas and I'm not able to get it working... Thanks in advance for any feedback!
Created 05-26-2016 04:05 AM
Update: I got to a working solution, this is a brief Howto to get to the result:
JOB MAIN BOX CONFIGURATION (CLICK THE "PENCIL" EDIT ICON
ON TOP OF THE WORKFLOW MAIN SCREEN):
Spark Master: yarn-cluster Mode: cluster App Name: MySpark Jars/py files: hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib/test.py Main Class: <WHATEVER_STRING_HERE> (E.g. "clear", or "org.apache.spark.examples.mllib.JavaALS"). We do not have a Main Class in our ".py" script! Arguments: NO ARGUMENTS DEFINED
WORKFLOW SETTINGS (CLICK GEAR ICON ON TOP RIGHT OF
THE WORKFLOW MAIN SCREEN):
Variables: oozie.use.system.libpath --> true Workspace: hue-oozie-1463575878.15 Hadoop Properties: oozie.launcher.yarn.app.mapreduce.am.env --> SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark Show Graph Arrows: CHECKED Version: uri.oozie.workflow.0.5 Job XML: EMPTY SLA Configuration: UNCHECKED
JOB DETAILED CONFIGURATION (CLICK THE "PENCIL" EDIT ICON
ON TOP OF THE WORKFLOW MAIN SCREEN AND THE THE TRIANGULAR
ICON ON TOP RIGHT OF THE MAIN JOB BOX TO EDIT IT IN DETAIL):
- PROPERTIES TAB: ----------------- Options List: --files hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/hive-site.xml Prepare: NO PREPARE STEPS DEFINED Job XML: EMPTY Properties: NO PROPERTIES DEFINED Retry: NO RETRY OPTIONS DEFINED - SLA TAB: ---------- Enabled: UNCHECKED - CREDENTIALS TAB: ------------------ Credentials: NO CREDENTIALS DEFINED - TRANSITIONS TAB: ------------------ Ok End Ko Kill
MANUALLY EDIT MINIMAL "hive-site.xml" FILE TO BE PASSED TO THE SPARK-ON-HIVE
CONTAINER TO BE ABLE TO ACCESS THE TABLES METASTORE FROM WHATEVER
NODE IN THE CLUSTER, AND UPLOAD IT TO HDFS:
vi hive-site.xml --- <configuration> <property> <name>hive.metastore.uris</name> <value>thrift://<THRIFT_HOSTNAME>:9083</value> </property> </configuration> --- hdfs dfs -put hive-site.xml /user/hue/oozie/workspaces/hue-oozie-1463575878.15
EDIT THE PYSPARK SCRIPT AND UPLOAD IT INTO THE "lib" DIRECTORY
IN THE WORKFLOW FOLDER:
vi test.py --- from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext from pyspark.sql import HiveContext from pyspark.sql.functions import * sconf = SparkConf().setAppName("MySpark").set("spark.driver.memory", "1g").setMaster("yarn-cluster") sc = SparkContext(conf=sconf) sqlCtx = HiveContext(sc) xxx_DF = sqlCtx.table("table") yyy_DF = xxx_DF.select("fieldname").saveAsTable("new_table") --- hdfs dfs -put test.py /user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib
NOW YOU CAN SUBLIT THE WORKFLOW IN YARN:
- Click the "PLAY" Submit Icon on top of the screen
ADDITIONAL INFO: AUTO-GENERATED "workflow.xml":
<workflow-app name="Spark_on_Oozie" xmlns="uri:oozie:workflow:0.5"> <global> <configuration> <property> <name>oozie.launcher.yarn.app.mapreduce.am.env</name> <value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark</value> </property> </configuration> </global> <start to="spark-9fa1"/> <kill name="Kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <action name="spark-9fa1"> <spark xmlns="uri:oozie:spark-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <master>yarn-cluster</master> <mode>cluster</mode> <name>MySpark</name> <class>clear</class> <jar>hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib/test.py</jar> <spark-opts>--files hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/hive-site.xml</spark-opts> </spark> <ok to="End"/> <error to="Kill"/> </action> <end name="End"/> </workflow-app>
ADDITIONAL INFO: AUTO-GENERATED "job.properties":
oozie.use.system.libpath=true security_enabled=False dryrun=False jobTracker=<JOBTRACKER_HOSTNAME>:8032
Created 05-14-2016 03:39 PM
Update: If I use "spark-submit", the script runs successfully.
Syntax used for "spark-submit":
spark-submit \ --master yarn-cluster \ --deploy-mode cluster \ --executor-memory 500M \ --total-executor-cores 1 \ hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/lib/test.py \ 10
Excerpt from output log:
16/05/15 00:30:57 INFO parse.ParseDriver: Parsing command: select * from sales_fact 16/05/15 00:30:58 INFO parse.ParseDriver: Parse Completed 16/05/15 00:30:58 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0-cdh5.5.1 16/05/15 00:30:58 INFO client.ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.5.1 16/05/15 00:30:59 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 16/05/15 00:30:59 INFO spark.SparkContext: Invoking stop() from shutdown hook
Created 05-26-2016 04:05 AM
Update: I got to a working solution, this is a brief Howto to get to the result:
JOB MAIN BOX CONFIGURATION (CLICK THE "PENCIL" EDIT ICON
ON TOP OF THE WORKFLOW MAIN SCREEN):
Spark Master: yarn-cluster Mode: cluster App Name: MySpark Jars/py files: hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib/test.py Main Class: <WHATEVER_STRING_HERE> (E.g. "clear", or "org.apache.spark.examples.mllib.JavaALS"). We do not have a Main Class in our ".py" script! Arguments: NO ARGUMENTS DEFINED
WORKFLOW SETTINGS (CLICK GEAR ICON ON TOP RIGHT OF
THE WORKFLOW MAIN SCREEN):
Variables: oozie.use.system.libpath --> true Workspace: hue-oozie-1463575878.15 Hadoop Properties: oozie.launcher.yarn.app.mapreduce.am.env --> SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark Show Graph Arrows: CHECKED Version: uri.oozie.workflow.0.5 Job XML: EMPTY SLA Configuration: UNCHECKED
JOB DETAILED CONFIGURATION (CLICK THE "PENCIL" EDIT ICON
ON TOP OF THE WORKFLOW MAIN SCREEN AND THE THE TRIANGULAR
ICON ON TOP RIGHT OF THE MAIN JOB BOX TO EDIT IT IN DETAIL):
- PROPERTIES TAB: ----------------- Options List: --files hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/hive-site.xml Prepare: NO PREPARE STEPS DEFINED Job XML: EMPTY Properties: NO PROPERTIES DEFINED Retry: NO RETRY OPTIONS DEFINED - SLA TAB: ---------- Enabled: UNCHECKED - CREDENTIALS TAB: ------------------ Credentials: NO CREDENTIALS DEFINED - TRANSITIONS TAB: ------------------ Ok End Ko Kill
MANUALLY EDIT MINIMAL "hive-site.xml" FILE TO BE PASSED TO THE SPARK-ON-HIVE
CONTAINER TO BE ABLE TO ACCESS THE TABLES METASTORE FROM WHATEVER
NODE IN THE CLUSTER, AND UPLOAD IT TO HDFS:
vi hive-site.xml --- <configuration> <property> <name>hive.metastore.uris</name> <value>thrift://<THRIFT_HOSTNAME>:9083</value> </property> </configuration> --- hdfs dfs -put hive-site.xml /user/hue/oozie/workspaces/hue-oozie-1463575878.15
EDIT THE PYSPARK SCRIPT AND UPLOAD IT INTO THE "lib" DIRECTORY
IN THE WORKFLOW FOLDER:
vi test.py --- from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext from pyspark.sql import HiveContext from pyspark.sql.functions import * sconf = SparkConf().setAppName("MySpark").set("spark.driver.memory", "1g").setMaster("yarn-cluster") sc = SparkContext(conf=sconf) sqlCtx = HiveContext(sc) xxx_DF = sqlCtx.table("table") yyy_DF = xxx_DF.select("fieldname").saveAsTable("new_table") --- hdfs dfs -put test.py /user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib
NOW YOU CAN SUBLIT THE WORKFLOW IN YARN:
- Click the "PLAY" Submit Icon on top of the screen
ADDITIONAL INFO: AUTO-GENERATED "workflow.xml":
<workflow-app name="Spark_on_Oozie" xmlns="uri:oozie:workflow:0.5"> <global> <configuration> <property> <name>oozie.launcher.yarn.app.mapreduce.am.env</name> <value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark</value> </property> </configuration> </global> <start to="spark-9fa1"/> <kill name="Kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <action name="spark-9fa1"> <spark xmlns="uri:oozie:spark-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <master>yarn-cluster</master> <mode>cluster</mode> <name>MySpark</name> <class>clear</class> <jar>hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib/test.py</jar> <spark-opts>--files hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/hive-site.xml</spark-opts> </spark> <ok to="End"/> <error to="Kill"/> </action> <end name="End"/> </workflow-app>
ADDITIONAL INFO: AUTO-GENERATED "job.properties":
oozie.use.system.libpath=true security_enabled=False dryrun=False jobTracker=<JOBTRACKER_HOSTNAME>:8032
Created 05-26-2016 04:53 AM
Congratulations on solving your issue and thank you for such a detailed description of the solution.
Created 03-29-2019 06:07 AM
Dear All.
I am facing the issue with Oozie while runing simple job with HUE GUI.
getting below error Please help me!
error:-
"traceback": [ [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/core/handlers/base.py", 112, "get_response", "response = wrapped_callback(request, *callback_args, **callback_kwargs)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/db/transaction.py", 371, "inner", "return func(*args, **kwargs)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/decorators.py", 113, "decorate", "return view_func(request, *args, **kwargs)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/decorators.py", 75, "decorate", "return view_func(request, *args, **kwargs)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/views/editor2.py", 373, "submit_workflow", "return _submit_workflow_helper(request, workflow, submit_action=reverse('oozie:editor_submit_workflow', kwargs={'doc_id': workflow.id}))" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/views/editor2.py", 428, "_submit_workflow_helper", "'is_oozie_mail_enabled': _is_oozie_mail_enabled(request.user)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/views/editor2.py", 435, "_is_oozie_mail_enabled", "oozie_conf = api.get_configuration()" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/desktop/libs/liboozie/src/liboozie/oozie_api.py", 319, "get_configuration", "resp = self._root.get('admin/configuration', params)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/desktop/core/src/desktop/lib/rest/resource.py", 100, "get", "return self.invoke(\"GET\", relpath, params, headers=headers, allow_redirects=True, clear_cookies=clear_cookies)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/desktop/core/src/desktop/lib/rest/resource.py", 80, "invoke", "clear_cookies=clear_cookies)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/desktop/core/src/desktop/lib/rest/http_client.py", 196, "execute", "raise self._exc_class(ex)" ] ] }
Thanks
HadoopHelp
Created on 09-08-2021 03:19 AM - edited 09-08-2021 03:20 AM
Hi,
I am having the same issue on CDP 7.1.6 with Oozie 5.1.0.
But the suggested solution does not seem to work anymore.
Setting
<property>
<name>oozie.launcher.yarn.app.mapreduce.am.env</name>
<value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark/</value>
</property>
has no effect.
Is there anything else I can do? Did the setting change?