Support Questions

Find answers, ask questions, and share your expertise

Oozie workflow, Spark action (using simple Dataframe): "Table not found" error

avatar
Super Collaborator

Hi all, my CDH test rig is as follows:

 

CDH 5.5.1

Spark 1.5.0

Oozie 4.1.0

 

I have successfully created a simple Oozie Workflow that spawns a Spark Action using HUE Interface. My intention is to use Yarn in Cluster mode to run the Workflow/Action.

 

It's a Python script, which is as follows (just a test):

 

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import *

sconf = SparkConf().setAppName("MySpark").set("spark.driver.memory", "1g").setMaster("yarn-cluster")
sc = SparkContext(conf=sconf)

### (1) ALTERNATIVELY USE ONE OF THE FOLLOWING CONTEXT DEFINITIONS:
sqlCtx = SQLContext(sc)
#sqlCtx = HiveContext(sc)

### (2) IF HIVECONTEXT, EVENTUALLY SET THE DATABASE IN USE (SHOULDN'T BE NECESSARY):
#sqlCtx .sql("use default")

### (3) CREATE MAIN DATAFRAME. TRY THE SYNTAXES ALTERNATIVELY, COMBINE WITH DIFFERENT (1):
#cronologico_DF = sqlCtx.table("sales_fact")
cronologico_DF = sqlCtx.sql("select * from sales_fact")

### (4) ANOTHER DATAFRAME
extraction_cronologico_DF = cronologico_DF.select("PRODUCT_KEY")

### (5) USELESS PRINT STATEMENT:
print 'a'

 

When I run the Workflow, a Mapreduce Job is started. Shortly after, a Spark Job is spawned (I can see that from the Job Browser).

 

The Spark Job fails with the following error (excerpt from the Log File of the Spark Acrion):

 

py4j.protocol.Py4JJavaError: An error occurred while calling o51.sql.
: java.lang.RuntimeException: Table Not Found: sales_fact

 

This is my "workflow.xml":

 

<workflow-app name="Churn_2015" xmlns="uri:oozie:workflow:0.5">
  <global>
      <job-xml>hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/hive-site.xml</job-xml>
            <configuration>
                <property>
                    <name>oozie.launcher.yarn.app.mapreduce.am.env</name>
                    <value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark</value>
                </property>
            </configuration>
  </global>
    <start to="spark-3ca0"/>
    <kill name="Kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <action name="spark-3ca0">
        <spark xmlns="uri:oozie:spark-action:0.1">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
              <job-xml>hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/hive-site.xml</job-xml>
            <configuration>
                <property>
                    <name>oozie.use.system.libpath</name>
                    <value>true</value>
                </property>
            </configuration>
            <master>yarn-cluster</master>
            <mode>cluster</mode>
            <name>MySpark</name>
              <class>org.apache.spark.examples.mllib.JavaALS</class>
            <jar>hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/lib/test.py</jar>
        </spark>
        <ok to="End"/>
        <error to="Kill"/>
    </action>
    <end name="End"/>
</workflow-app>

 

This is my "job.properties":

 

oozie.use.system.libpath=True
security_enabled=False
dryrun=False
jobTracker=<MY_SERVER_FQDN_HERE>:8032
nameNode=hdfs://<MY_SERVER_FQDN_HERE>:8020

Please note that:

 

1) I've also uploaded "hive-site.xml" in the same directory as the 2 files described above. As you can see from "workflow.xml", it should also be picked up.

 

2) The "test.py" script is under a "lib" directory in the Workspace created by HUE. It gets picked up. In that directory I also took care of uploading several Jars belonging to some Derby DB Connector, probably required to collect Stats, to avoid other exceptions being throwed.

 

3) I've tried adding a workflow property "oozie.action.sharelib.for.spark", with value "hcatalog,hive,hive2", with no success

 

4) As you can see in the Python Script described above, I've been using alternatively an SQLContext or a HiveContext object inside the Script. Results are the same (the error message is slightly different though).

 

5) ShareLib should be OK too:

 

oozie admin -shareliblist

[Available ShareLib]
oozie
hive
distcp
hcatalog
sqoop
mapreduce-streaming
spark
hive2
pig

 

I'm suspecting the Tables Metastore is not being read, that's probably the issue. But I ran out of ideas and I'm not able to get it working... Thanks in advance for any feedback!

 

 

 

 

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Update: I got to a working solution, this is a brief Howto to get to the result:

 

 

JOB MAIN BOX CONFIGURATION (CLICK THE "PENCIL" EDIT ICON

ON TOP OF THE WORKFLOW MAIN SCREEN):

Spark Master:			yarn-cluster
Mode:				cluster
App Name:			MySpark
Jars/py files:			hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib/test.py
Main Class:			<WHATEVER_STRING_HERE>  (E.g. "clear", or "org.apache.spark.examples.mllib.JavaALS"). We do not have a Main Class in our ".py" script!
Arguments:			NO ARGUMENTS DEFINED


 

WORKFLOW SETTINGS (CLICK GEAR ICON ON TOP RIGHT OF

THE WORKFLOW MAIN SCREEN):

Variables:			oozie.use.system.libpath --> true
Workspace:			hue-oozie-1463575878.15
Hadoop Properties:		oozie.launcher.yarn.app.mapreduce.am.env --> SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
Show Graph Arrows:		CHECKED
Version:			uri.oozie.workflow.0.5
Job XML:			EMPTY
SLA Configuration:		UNCHECKED

 

JOB DETAILED CONFIGURATION (CLICK THE "PENCIL" EDIT ICON

ON TOP OF THE WORKFLOW MAIN SCREEN AND THE THE TRIANGULAR

ICON ON TOP RIGHT OF THE MAIN JOB BOX TO EDIT IT IN DETAIL):

- PROPERTIES TAB:
-----------------
Options List:			--files hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/hive-site.xml
Prepare:			NO PREPARE STEPS DEFINED
Job XML:			EMPTY
Properties:			NO PROPERTIES DEFINED
Retry:				NO RETRY OPTIONS DEFINED

- SLA TAB:
----------
Enabled:			UNCHECKED

- CREDENTIALS TAB:
------------------
Credentials:			NO CREDENTIALS DEFINED

- TRANSITIONS TAB:
------------------
Ok				End
Ko				Kill

 

 

MANUALLY EDIT MINIMAL "hive-site.xml" FILE TO BE PASSED TO THE SPARK-ON-HIVE

CONTAINER TO BE ABLE TO ACCESS THE TABLES METASTORE FROM WHATEVER

NODE IN THE CLUSTER, AND UPLOAD IT TO HDFS:

vi hive-site.xml

---
<configuration>
	<property>
		<name>hive.metastore.uris</name>
		<value>thrift://<THRIFT_HOSTNAME>:9083</value>
	</property>
</configuration>
---

hdfs dfs -put hive-site.xml /user/hue/oozie/workspaces/hue-oozie-1463575878.15

 

 

EDIT THE PYSPARK SCRIPT AND UPLOAD IT INTO THE "lib" DIRECTORY

IN THE WORKFLOW FOLDER:

vi test.py

---
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import *

sconf = SparkConf().setAppName("MySpark").set("spark.driver.memory", "1g").setMaster("yarn-cluster")
sc = SparkContext(conf=sconf)

sqlCtx = HiveContext(sc)

xxx_DF = sqlCtx.table("table")
yyy_DF = xxx_DF.select("fieldname").saveAsTable("new_table")
---

hdfs dfs -put test.py /user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib

 

NOW YOU CAN SUBLIT THE WORKFLOW IN YARN:

- Click the "PLAY" Submit Icon on top of the screen

 

ADDITIONAL INFO: AUTO-GENERATED "workflow.xml":

<workflow-app name="Spark_on_Oozie" xmlns="uri:oozie:workflow:0.5">
  <global>
            <configuration>
                <property>
                    <name>oozie.launcher.yarn.app.mapreduce.am.env</name>
                    <value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark</value>
                </property>
            </configuration>
  </global>
    <start to="spark-9fa1"/>
    <kill name="Kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <action name="spark-9fa1">
        <spark xmlns="uri:oozie:spark-action:0.1">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <master>yarn-cluster</master>
            <mode>cluster</mode>
            <name>MySpark</name>
              <class>clear</class>
            <jar>hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib/test.py</jar>
              <spark-opts>--files hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/hive-site.xml</spark-opts>
        </spark>
        <ok to="End"/>
        <error to="Kill"/>
    </action>
    <end name="End"/>
</workflow-app>

 

ADDITIONAL INFO: AUTO-GENERATED "job.properties":

oozie.use.system.libpath=true
security_enabled=False
dryrun=False
jobTracker=<JOBTRACKER_HOSTNAME>:8032

 

 

 

 

View solution in original post

5 REPLIES 5

avatar
Super Collaborator

Update: If I use "spark-submit", the script runs successfully.

 

 

Syntax used for "spark-submit":

 

spark-submit \
  --master yarn-cluster \
  --deploy-mode cluster \
  --executor-memory 500M \
  --total-executor-cores 1 \
  hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/lib/test.py \
  10

 

 

Excerpt from output log:

 

16/05/15 00:30:57 INFO parse.ParseDriver: Parsing command: select * from sales_fact
16/05/15 00:30:58 INFO parse.ParseDriver: Parse Completed
16/05/15 00:30:58 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0-cdh5.5.1
16/05/15 00:30:58 INFO client.ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.5.1
16/05/15 00:30:59 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
16/05/15 00:30:59 INFO spark.SparkContext: Invoking stop() from shutdown hook

avatar
Super Collaborator

Update: I got to a working solution, this is a brief Howto to get to the result:

 

 

JOB MAIN BOX CONFIGURATION (CLICK THE "PENCIL" EDIT ICON

ON TOP OF THE WORKFLOW MAIN SCREEN):

Spark Master:			yarn-cluster
Mode:				cluster
App Name:			MySpark
Jars/py files:			hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib/test.py
Main Class:			<WHATEVER_STRING_HERE>  (E.g. "clear", or "org.apache.spark.examples.mllib.JavaALS"). We do not have a Main Class in our ".py" script!
Arguments:			NO ARGUMENTS DEFINED


 

WORKFLOW SETTINGS (CLICK GEAR ICON ON TOP RIGHT OF

THE WORKFLOW MAIN SCREEN):

Variables:			oozie.use.system.libpath --> true
Workspace:			hue-oozie-1463575878.15
Hadoop Properties:		oozie.launcher.yarn.app.mapreduce.am.env --> SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
Show Graph Arrows:		CHECKED
Version:			uri.oozie.workflow.0.5
Job XML:			EMPTY
SLA Configuration:		UNCHECKED

 

JOB DETAILED CONFIGURATION (CLICK THE "PENCIL" EDIT ICON

ON TOP OF THE WORKFLOW MAIN SCREEN AND THE THE TRIANGULAR

ICON ON TOP RIGHT OF THE MAIN JOB BOX TO EDIT IT IN DETAIL):

- PROPERTIES TAB:
-----------------
Options List:			--files hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/hive-site.xml
Prepare:			NO PREPARE STEPS DEFINED
Job XML:			EMPTY
Properties:			NO PROPERTIES DEFINED
Retry:				NO RETRY OPTIONS DEFINED

- SLA TAB:
----------
Enabled:			UNCHECKED

- CREDENTIALS TAB:
------------------
Credentials:			NO CREDENTIALS DEFINED

- TRANSITIONS TAB:
------------------
Ok				End
Ko				Kill

 

 

MANUALLY EDIT MINIMAL "hive-site.xml" FILE TO BE PASSED TO THE SPARK-ON-HIVE

CONTAINER TO BE ABLE TO ACCESS THE TABLES METASTORE FROM WHATEVER

NODE IN THE CLUSTER, AND UPLOAD IT TO HDFS:

vi hive-site.xml

---
<configuration>
	<property>
		<name>hive.metastore.uris</name>
		<value>thrift://<THRIFT_HOSTNAME>:9083</value>
	</property>
</configuration>
---

hdfs dfs -put hive-site.xml /user/hue/oozie/workspaces/hue-oozie-1463575878.15

 

 

EDIT THE PYSPARK SCRIPT AND UPLOAD IT INTO THE "lib" DIRECTORY

IN THE WORKFLOW FOLDER:

vi test.py

---
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import *

sconf = SparkConf().setAppName("MySpark").set("spark.driver.memory", "1g").setMaster("yarn-cluster")
sc = SparkContext(conf=sconf)

sqlCtx = HiveContext(sc)

xxx_DF = sqlCtx.table("table")
yyy_DF = xxx_DF.select("fieldname").saveAsTable("new_table")
---

hdfs dfs -put test.py /user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib

 

NOW YOU CAN SUBLIT THE WORKFLOW IN YARN:

- Click the "PLAY" Submit Icon on top of the screen

 

ADDITIONAL INFO: AUTO-GENERATED "workflow.xml":

<workflow-app name="Spark_on_Oozie" xmlns="uri:oozie:workflow:0.5">
  <global>
            <configuration>
                <property>
                    <name>oozie.launcher.yarn.app.mapreduce.am.env</name>
                    <value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark</value>
                </property>
            </configuration>
  </global>
    <start to="spark-9fa1"/>
    <kill name="Kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <action name="spark-9fa1">
        <spark xmlns="uri:oozie:spark-action:0.1">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <master>yarn-cluster</master>
            <mode>cluster</mode>
            <name>MySpark</name>
              <class>clear</class>
            <jar>hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/lib/test.py</jar>
              <spark-opts>--files hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/hive-site.xml</spark-opts>
        </spark>
        <ok to="End"/>
        <error to="Kill"/>
    </action>
    <end name="End"/>
</workflow-app>

 

ADDITIONAL INFO: AUTO-GENERATED "job.properties":

oozie.use.system.libpath=true
security_enabled=False
dryrun=False
jobTracker=<JOBTRACKER_HOSTNAME>:8032

 

 

 

 

avatar
Community Manager

Congratulations on solving your issue and thank you for such a detailed description of the solution.


Cy Jervis, Manager, Community Program
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

avatar
Contributor

Dear All.

I am facing the issue with Oozie while runing simple job with HUE GUI.

getting below error Please help me!

 

error:-

 "traceback": [ [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/core/handlers/base.py", 112, "get_response", "response = wrapped_callback(request, *callback_args, **callback_kwargs)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/db/transaction.py", 371, "inner", "return func(*args, **kwargs)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/decorators.py", 113, "decorate", "return view_func(request, *args, **kwargs)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/decorators.py", 75, "decorate", "return view_func(request, *args, **kwargs)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/views/editor2.py", 373, "submit_workflow", "return _submit_workflow_helper(request, workflow, submit_action=reverse('oozie:editor_submit_workflow', kwargs={'doc_id': workflow.id}))" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/views/editor2.py", 428, "_submit_workflow_helper", "'is_oozie_mail_enabled': _is_oozie_mail_enabled(request.user)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/views/editor2.py", 435, "_is_oozie_mail_enabled", "oozie_conf = api.get_configuration()" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/desktop/libs/liboozie/src/liboozie/oozie_api.py", 319, "get_configuration", "resp = self._root.get('admin/configuration', params)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/desktop/core/src/desktop/lib/rest/resource.py", 100, "get", "return self.invoke(\"GET\", relpath, params, headers=headers, allow_redirects=True, clear_cookies=clear_cookies)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/desktop/core/src/desktop/lib/rest/resource.py", 80, "invoke", "clear_cookies=clear_cookies)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/desktop/core/src/desktop/lib/rest/http_client.py", 196, "execute", "raise self._exc_class(ex)" ] ] }

 

 

 

Thanks

HadoopHelp

avatar
Contributor

Hi, 

I am having the same issue on CDP 7.1.6 with Oozie 5.1.0.

But the suggested solution does not seem to work anymore. 

 

Setting

                <property>
                    <name>oozie.launcher.yarn.app.mapreduce.am.env</name>
                    <value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark/</value>
                </property>

has no effect.

 

Is there anything else I can do? Did the setting change?