- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Oozie workflow, Spark action (using simple Dataframe): "Table not found" error
- Labels:
-
Apache Oozie
Created on ‎05-14-2016 08:53 AM - edited ‎09-16-2022 03:19 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all, my CDH test rig is as follows:
CDH 5.5.1
Spark 1.5.0
Oozie 4.1.0
I have successfully created a simple Oozie Workflow that spawns a Spark Action using HUE Interface. My intention is to use Yarn in Cluster mode to run the Workflow/Action.
It's a Python script, which is as follows (just a test):
from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext from pyspark.sql import HiveContext from pyspark.sql.functions import * sconf = SparkConf().setAppName("MySpark").set("spark.driver.memory", "1g").setMaster("yarn-cluster") sc = SparkContext(conf=sconf) ### (1) ALTERNATIVELY USE ONE OF THE FOLLOWING CONTEXT DEFINITIONS: sqlCtx = SQLContext(sc) #sqlCtx = HiveContext(sc) ### (2) IF HIVECONTEXT, EVENTUALLY SET THE DATABASE IN USE (SHOULDN'T BE NECESSARY): #sqlCtx .sql("use default") ### (3) CREATE MAIN DATAFRAME. TRY THE SYNTAXES ALTERNATIVELY, COMBINE WITH DIFFERENT (1): #cronologico_DF = sqlCtx.table("sales_fact") cronologico_DF = sqlCtx.sql("select * from sales_fact") ### (4) ANOTHER DATAFRAME extraction_cronologico_DF = cronologico_DF.select("PRODUCT_KEY") ### (5) USELESS PRINT STATEMENT: print 'a'
When I run the Workflow, a Mapreduce Job is started. Shortly after, a Spark Job is spawned (I can see that from the Job Browser).
The Spark Job fails with the following error (excerpt from the Log File of the Spark Acrion):
py4j.protocol.Py4JJavaError: An error occurred while calling o51.sql. : java.lang.RuntimeException: Table Not Found: sales_fact
This is my "workflow.xml":
<workflow-app name="Churn_2015" xmlns="uri:oozie:workflow:0.5"> <global> <job-xml>hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/hive-site.xml</job-xml> <configuration> <property> <name>oozie.launcher.yarn.app.mapreduce.am.env</name> <value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark</value> </property> </configuration> </global> <start to="spark-3ca0"/> <kill name="Kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <action name="spark-3ca0"> <spark xmlns="uri:oozie:spark-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <job-xml>hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/hive-site.xml</job-xml> <configuration> <property> <name>oozie.use.system.libpath</name> <value>true</value> </property> </configuration> <master>yarn-cluster</master> <mode>cluster</mode> <name>MySpark</name> <class>org.apache.spark.examples.mllib.JavaALS</class> <jar>hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/lib/test.py</jar> </spark> <ok to="End"/> <error to="Kill"/> </action> <end name="End"/> </workflow-app>
This is my "job.properties":
oozie.use.system.libpath=True security_enabled=False dryrun=False jobTracker=<MY_SERVER_FQDN_HERE>:8032 nameNode=hdfs://<MY_SERVER_FQDN_HERE>:8020
Please note that:
1) I've also uploaded "hive-site.xml" in the same directory as the 2 files described above. As you can see from "workflow.xml", it should also be picked up.
2) The "test.py" script is under a "lib" directory in the Workspace created by HUE. It gets picked up. In that directory I also took care of uploading several Jars belonging to some Derby DB Connector, probably required to collect Stats, to avoid other exceptions being throwed.
3) I've tried adding a workflow property "oozie.action.sharelib.for.spark", with value "hcatalog,hive,hive2", with no success
4) As you can see in the Python Script described above, I've been using alternatively an SQLContext or a HiveContext object inside the Script. Results are the same (the error message is slightly different though).
5) ShareLib should be OK too:
oozie admin -shareliblist [Available ShareLib] oozie hive distcp hcatalog sqoop mapreduce-streaming spark hive2 pig
I'm suspecting the Tables Metastore is not being read, that's probably the issue. But I ran out of ideas and I'm not able to get it working... Thanks in advance for any feedback!
Created ‎05-26-2016 04:05 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Update: I got to a working solution, this is a brief Howto to get to the result:
JOB MAIN BOX CONFIGURATION (CLICK THE "PENCIL" EDIT ICON
ON TOP OF THE WORKFLOW MAIN SCREEN):
Spark Master: yarn-cluster Mode: cluster App Name: MySpark Jars/py files: hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib/test.py Main Class: <WHATEVER_STRING_HERE> (E.g. "clear", or "org.apache.spark.examples.mllib.JavaALS"). We do not have a Main Class in our ".py" script! Arguments: NO ARGUMENTS DEFINED
WORKFLOW SETTINGS (CLICK GEAR ICON ON TOP RIGHT OF
THE WORKFLOW MAIN SCREEN):
Variables: oozie.use.system.libpath --> true Workspace: hue-oozie-1463575878.15 Hadoop Properties: oozie.launcher.yarn.app.mapreduce.am.env --> SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark Show Graph Arrows: CHECKED Version: uri.oozie.workflow.0.5 Job XML: EMPTY SLA Configuration: UNCHECKED
JOB DETAILED CONFIGURATION (CLICK THE "PENCIL" EDIT ICON
ON TOP OF THE WORKFLOW MAIN SCREEN AND THE THE TRIANGULAR
ICON ON TOP RIGHT OF THE MAIN JOB BOX TO EDIT IT IN DETAIL):
- PROPERTIES TAB: ----------------- Options List: --files hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/hive-site.xml Prepare: NO PREPARE STEPS DEFINED Job XML: EMPTY Properties: NO PROPERTIES DEFINED Retry: NO RETRY OPTIONS DEFINED - SLA TAB: ---------- Enabled: UNCHECKED - CREDENTIALS TAB: ------------------ Credentials: NO CREDENTIALS DEFINED - TRANSITIONS TAB: ------------------ Ok End Ko Kill
MANUALLY EDIT MINIMAL "hive-site.xml" FILE TO BE PASSED TO THE SPARK-ON-HIVE
CONTAINER TO BE ABLE TO ACCESS THE TABLES METASTORE FROM WHATEVER
NODE IN THE CLUSTER, AND UPLOAD IT TO HDFS:
vi hive-site.xml --- <configuration> <property> <name>hive.metastore.uris</name> <value>thrift://<THRIFT_HOSTNAME>:9083</value> </property> </configuration> --- hdfs dfs -put hive-site.xml /user/hue/oozie/workspaces/hue-oozie-1463575878.15
EDIT THE PYSPARK SCRIPT AND UPLOAD IT INTO THE "lib" DIRECTORY
IN THE WORKFLOW FOLDER:
vi test.py --- from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext from pyspark.sql import HiveContext from pyspark.sql.functions import * sconf = SparkConf().setAppName("MySpark").set("spark.driver.memory", "1g").setMaster("yarn-cluster") sc = SparkContext(conf=sconf) sqlCtx = HiveContext(sc) xxx_DF = sqlCtx.table("table") yyy_DF = xxx_DF.select("fieldname").saveAsTable("new_table") --- hdfs dfs -put test.py /user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib
NOW YOU CAN SUBLIT THE WORKFLOW IN YARN:
- Click the "PLAY" Submit Icon on top of the screen
ADDITIONAL INFO: AUTO-GENERATED "workflow.xml":
<workflow-app name="Spark_on_Oozie" xmlns="uri:oozie:workflow:0.5"> <global> <configuration> <property> <name>oozie.launcher.yarn.app.mapreduce.am.env</name> <value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark</value> </property> </configuration> </global> <start to="spark-9fa1"/> <kill name="Kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <action name="spark-9fa1"> <spark xmlns="uri:oozie:spark-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <master>yarn-cluster</master> <mode>cluster</mode> <name>MySpark</name> <class>clear</class> <jar>hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib/test.py</jar> <spark-opts>--files hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/hive-site.xml</spark-opts> </spark> <ok to="End"/> <error to="Kill"/> </action> <end name="End"/> </workflow-app>
ADDITIONAL INFO: AUTO-GENERATED "job.properties":
oozie.use.system.libpath=true security_enabled=False dryrun=False jobTracker=<JOBTRACKER_HOSTNAME>:8032
Created ‎05-14-2016 03:39 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Update: If I use "spark-submit", the script runs successfully.
Syntax used for "spark-submit":
spark-submit \ --master yarn-cluster \ --deploy-mode cluster \ --executor-memory 500M \ --total-executor-cores 1 \ hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/lib/test.py \ 10
Excerpt from output log:
16/05/15 00:30:57 INFO parse.ParseDriver: Parsing command: select * from sales_fact 16/05/15 00:30:58 INFO parse.ParseDriver: Parse Completed 16/05/15 00:30:58 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0-cdh5.5.1 16/05/15 00:30:58 INFO client.ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.5.1 16/05/15 00:30:59 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 16/05/15 00:30:59 INFO spark.SparkContext: Invoking stop() from shutdown hook
Created ‎05-26-2016 04:05 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Update: I got to a working solution, this is a brief Howto to get to the result:
JOB MAIN BOX CONFIGURATION (CLICK THE "PENCIL" EDIT ICON
ON TOP OF THE WORKFLOW MAIN SCREEN):
Spark Master: yarn-cluster Mode: cluster App Name: MySpark Jars/py files: hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib/test.py Main Class: <WHATEVER_STRING_HERE> (E.g. "clear", or "org.apache.spark.examples.mllib.JavaALS"). We do not have a Main Class in our ".py" script! Arguments: NO ARGUMENTS DEFINED
WORKFLOW SETTINGS (CLICK GEAR ICON ON TOP RIGHT OF
THE WORKFLOW MAIN SCREEN):
Variables: oozie.use.system.libpath --> true Workspace: hue-oozie-1463575878.15 Hadoop Properties: oozie.launcher.yarn.app.mapreduce.am.env --> SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark Show Graph Arrows: CHECKED Version: uri.oozie.workflow.0.5 Job XML: EMPTY SLA Configuration: UNCHECKED
JOB DETAILED CONFIGURATION (CLICK THE "PENCIL" EDIT ICON
ON TOP OF THE WORKFLOW MAIN SCREEN AND THE THE TRIANGULAR
ICON ON TOP RIGHT OF THE MAIN JOB BOX TO EDIT IT IN DETAIL):
- PROPERTIES TAB: ----------------- Options List: --files hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/hive-site.xml Prepare: NO PREPARE STEPS DEFINED Job XML: EMPTY Properties: NO PROPERTIES DEFINED Retry: NO RETRY OPTIONS DEFINED - SLA TAB: ---------- Enabled: UNCHECKED - CREDENTIALS TAB: ------------------ Credentials: NO CREDENTIALS DEFINED - TRANSITIONS TAB: ------------------ Ok End Ko Kill
MANUALLY EDIT MINIMAL "hive-site.xml" FILE TO BE PASSED TO THE SPARK-ON-HIVE
CONTAINER TO BE ABLE TO ACCESS THE TABLES METASTORE FROM WHATEVER
NODE IN THE CLUSTER, AND UPLOAD IT TO HDFS:
vi hive-site.xml --- <configuration> <property> <name>hive.metastore.uris</name> <value>thrift://<THRIFT_HOSTNAME>:9083</value> </property> </configuration> --- hdfs dfs -put hive-site.xml /user/hue/oozie/workspaces/hue-oozie-1463575878.15
EDIT THE PYSPARK SCRIPT AND UPLOAD IT INTO THE "lib" DIRECTORY
IN THE WORKFLOW FOLDER:
vi test.py --- from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext from pyspark.sql import HiveContext from pyspark.sql.functions import * sconf = SparkConf().setAppName("MySpark").set("spark.driver.memory", "1g").setMaster("yarn-cluster") sc = SparkContext(conf=sconf) sqlCtx = HiveContext(sc) xxx_DF = sqlCtx.table("table") yyy_DF = xxx_DF.select("fieldname").saveAsTable("new_table") --- hdfs dfs -put test.py /user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib
NOW YOU CAN SUBLIT THE WORKFLOW IN YARN:
- Click the "PLAY" Submit Icon on top of the screen
ADDITIONAL INFO: AUTO-GENERATED "workflow.xml":
<workflow-app name="Spark_on_Oozie" xmlns="uri:oozie:workflow:0.5"> <global> <configuration> <property> <name>oozie.launcher.yarn.app.mapreduce.am.env</name> <value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark</value> </property> </configuration> </global> <start to="spark-9fa1"/> <kill name="Kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <action name="spark-9fa1"> <spark xmlns="uri:oozie:spark-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <master>yarn-cluster</master> <mode>cluster</mode> <name>MySpark</name> <class>clear</class> <jar>hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib/test.py</jar> <spark-opts>--files hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/hive-site.xml</spark-opts> </spark> <ok to="End"/> <error to="Kill"/> </action> <end name="End"/> </workflow-app>
ADDITIONAL INFO: AUTO-GENERATED "job.properties":
oozie.use.system.libpath=true security_enabled=False dryrun=False jobTracker=<JOBTRACKER_HOSTNAME>:8032
Created ‎05-26-2016 04:53 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Congratulations on solving your issue and thank you for such a detailed description of the solution.
Cy Jervis, Manager, Community Program
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Created ‎03-29-2019 06:07 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear All.
I am facing the issue with Oozie while runing simple job with HUE GUI.
getting below error Please help me!
error:-
"traceback": [ [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/core/handlers/base.py", 112, "get_response", "response = wrapped_callback(request, *callback_args, **callback_kwargs)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/db/transaction.py", 371, "inner", "return func(*args, **kwargs)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/decorators.py", 113, "decorate", "return view_func(request, *args, **kwargs)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/decorators.py", 75, "decorate", "return view_func(request, *args, **kwargs)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/views/editor2.py", 373, "submit_workflow", "return _submit_workflow_helper(request, workflow, submit_action=reverse('oozie:editor_submit_workflow', kwargs={'doc_id': workflow.id}))" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/views/editor2.py", 428, "_submit_workflow_helper", "'is_oozie_mail_enabled': _is_oozie_mail_enabled(request.user)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/views/editor2.py", 435, "_is_oozie_mail_enabled", "oozie_conf = api.get_configuration()" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/desktop/libs/liboozie/src/liboozie/oozie_api.py", 319, "get_configuration", "resp = self._root.get('admin/configuration', params)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/desktop/core/src/desktop/lib/rest/resource.py", 100, "get", "return self.invoke(\"GET\", relpath, params, headers=headers, allow_redirects=True, clear_cookies=clear_cookies)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/desktop/core/src/desktop/lib/rest/resource.py", 80, "invoke", "clear_cookies=clear_cookies)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/desktop/core/src/desktop/lib/rest/http_client.py", 196, "execute", "raise self._exc_class(ex)" ] ] }
Thanks
HadoopHelp
Created on ‎09-08-2021 03:19 AM - edited ‎09-08-2021 03:20 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am having the same issue on CDP 7.1.6 with Oozie 5.1.0.
But the suggested solution does not seem to work anymore.
Setting
<property>
<name>oozie.launcher.yarn.app.mapreduce.am.env</name>
<value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark/</value>
</property>
has no effect.
Is there anything else I can do? Did the setting change?
