Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Oozie workflow, Spark action (using simple Dataframe): "Table not found" error

avatar
Expert Contributor

Hi all, my CDH test rig is as follows:

 

CDH 5.5.1

Spark 1.5.0

Oozie 4.1.0

 

I have successfully created a simple Oozie Workflow that spawns a Spark Action using HUE Interface. My intention is to use Yarn in Cluster mode to run the Workflow/Action.

 

It's a Python script, which is as follows (just a test):

 

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import *

sconf = SparkConf().setAppName("MySpark").set("spark.driver.memory", "1g").setMaster("yarn-cluster")
sc = SparkContext(conf=sconf)

### (1) ALTERNATIVELY USE ONE OF THE FOLLOWING CONTEXT DEFINITIONS:
sqlCtx = SQLContext(sc)
#sqlCtx = HiveContext(sc)

### (2) IF HIVECONTEXT, EVENTUALLY SET THE DATABASE IN USE (SHOULDN'T BE NECESSARY):
#sqlCtx .sql("use default")

### (3) CREATE MAIN DATAFRAME. TRY THE SYNTAXES ALTERNATIVELY, COMBINE WITH DIFFERENT (1):
#cronologico_DF = sqlCtx.table("sales_fact")
cronologico_DF = sqlCtx.sql("select * from sales_fact")

### (4) ANOTHER DATAFRAME
extraction_cronologico_DF = cronologico_DF.select("PRODUCT_KEY")

### (5) USELESS PRINT STATEMENT:
print 'a'

 

When I run the Workflow, a Mapreduce Job is started. Shortly after, a Spark Job is spawned (I can see that from the Job Browser).

 

The Spark Job fails with the following error (excerpt from the Log File of the Spark Acrion):

 

py4j.protocol.Py4JJavaError: An error occurred while calling o51.sql.
: java.lang.RuntimeException: Table Not Found: sales_fact

 

This is my "workflow.xml":

 

<workflow-app name="Churn_2015" xmlns="uri:oozie:workflow:0.5">
  <global>
      <job-xml>hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/hive-site.xml</job-xml>
            <configuration>
                <property>
                    <name>oozie.launcher.yarn.app.mapreduce.am.env</name>
                    <value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark</value>
                </property>
            </configuration>
  </global>
    <start to="spark-3ca0"/>
    <kill name="Kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <action name="spark-3ca0">
        <spark xmlns="uri:oozie:spark-action:0.1">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
              <job-xml>hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/hive-site.xml</job-xml>
            <configuration>
                <property>
                    <name>oozie.use.system.libpath</name>
                    <value>true</value>
                </property>
            </configuration>
            <master>yarn-cluster</master>
            <mode>cluster</mode>
            <name>MySpark</name>
              <class>org.apache.spark.examples.mllib.JavaALS</class>
            <jar>hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/lib/test.py</jar>
        </spark>
        <ok to="End"/>
        <error to="Kill"/>
    </action>
    <end name="End"/>
</workflow-app>

 

This is my "job.properties":

 

oozie.use.system.libpath=True
security_enabled=False
dryrun=False
jobTracker=<MY_SERVER_FQDN_HERE>:8032
nameNode=hdfs://<MY_SERVER_FQDN_HERE>:8020

Please note that:

 

1) I've also uploaded "hive-site.xml" in the same directory as the 2 files described above. As you can see from "workflow.xml", it should also be picked up.

 

2) The "test.py" script is under a "lib" directory in the Workspace created by HUE. It gets picked up. In that directory I also took care of uploading several Jars belonging to some Derby DB Connector, probably required to collect Stats, to avoid other exceptions being throwed.

 

3) I've tried adding a workflow property "oozie.action.sharelib.for.spark", with value "hcatalog,hive,hive2", with no success

 

4) As you can see in the Python Script described above, I've been using alternatively an SQLContext or a HiveContext object inside the Script. Results are the same (the error message is slightly different though).

 

5) ShareLib should be OK too:

 

oozie admin -shareliblist

[Available ShareLib]
oozie
hive
distcp
hcatalog
sqoop
mapreduce-streaming
spark
hive2
pig

 

I'm suspecting the Tables Metastore is not being read, that's probably the issue. But I ran out of ideas and I'm not able to get it working... Thanks in advance for any feedback!

 

 

 

 

1 ACCEPTED SOLUTION

avatar
Expert Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
5 REPLIES 5

avatar
Expert Contributor

Update: If I use "spark-submit", the script runs successfully.

 

 

Syntax used for "spark-submit":

 

spark-submit \
  --master yarn-cluster \
  --deploy-mode cluster \
  --executor-memory 500M \
  --total-executor-cores 1 \
  hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/lib/test.py \
  10

 

 

Excerpt from output log:

 

16/05/15 00:30:57 INFO parse.ParseDriver: Parsing command: select * from sales_fact
16/05/15 00:30:58 INFO parse.ParseDriver: Parse Completed
16/05/15 00:30:58 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0-cdh5.5.1
16/05/15 00:30:58 INFO client.ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.5.1
16/05/15 00:30:59 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
16/05/15 00:30:59 INFO spark.SparkContext: Invoking stop() from shutdown hook

avatar
Expert Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Community Manager

Congratulations on solving your issue and thank you for such a detailed description of the solution.


Cy Jervis, Manager, Community Program
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

avatar
Contributor

Dear All.

I am facing the issue with Oozie while runing simple job with HUE GUI.

getting below error Please help me!

 

error:-

 "traceback": [ [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/core/handlers/base.py", 112, "get_response", "response = wrapped_callback(request, *callback_args, **callback_kwargs)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/db/transaction.py", 371, "inner", "return func(*args, **kwargs)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/decorators.py", 113, "decorate", "return view_func(request, *args, **kwargs)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/decorators.py", 75, "decorate", "return view_func(request, *args, **kwargs)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/views/editor2.py", 373, "submit_workflow", "return _submit_workflow_helper(request, workflow, submit_action=reverse('oozie:editor_submit_workflow', kwargs={'doc_id': workflow.id}))" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/views/editor2.py", 428, "_submit_workflow_helper", "'is_oozie_mail_enabled': _is_oozie_mail_enabled(request.user)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/apps/oozie/src/oozie/views/editor2.py", 435, "_is_oozie_mail_enabled", "oozie_conf = api.get_configuration()" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/desktop/libs/liboozie/src/liboozie/oozie_api.py", 319, "get_configuration", "resp = self._root.get('admin/configuration', params)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/desktop/core/src/desktop/lib/rest/resource.py", 100, "get", "return self.invoke(\"GET\", relpath, params, headers=headers, allow_redirects=True, clear_cookies=clear_cookies)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/desktop/core/src/desktop/lib/rest/resource.py", 80, "invoke", "clear_cookies=clear_cookies)" ], [ "/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hue/desktop/core/src/desktop/lib/rest/http_client.py", 196, "execute", "raise self._exc_class(ex)" ] ] }

 

 

 

Thanks

HadoopHelp

avatar
Contributor

Hi, 

I am having the same issue on CDP 7.1.6 with Oozie 5.1.0.

But the suggested solution does not seem to work anymore. 

 

Setting

                <property>
                    <name>oozie.launcher.yarn.app.mapreduce.am.env</name>
                    <value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark/</value>
                </property>

has no effect.

 

Is there anything else I can do? Did the setting change?