Created on 10-27-2017 07:59 AM - edited 09-16-2022 05:27 AM
Hi,
I am getting an error while running a python script using shell action in Hue/oozie. My workflow xml is given below. Any ideas? Thanks.
from pyspark import SparkContext
ImportError: No module named pyspark
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]
--------------------------------------------------------------------------------
<workflow-app name="My Workflow" xmlns="uri:oozie:workflow:0.5">
<start to="shell-8cca"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="shell-8cca">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>oozie.launcher.mapred.child.env</name>
<value>PYTHONPATH=/usr/bin/python</value>
</property>
<property>
<name>oozie.launcher.mapred.child.env</name>
<value>PYSPARK_PYTHON=/usr/bin/pyspark</value>
</property>
</configuration>
<exec>shexample7.sh</exec>
<env-var>PYTHONPATH=/usr/bin/python</env-var>
<env-var>PYSPARK_PYTHON=/usr/bin/pyspark</env-var>
<file>/user/admin/shexample7.sh#shexample7.sh</file>
<file>/user/admin/pyexample.py#pyexample.py</file>
<capture-output/>
</shell>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>
Created 10-29-2017 03:32 PM
Created on 10-29-2017 04:07 PM - edited 10-29-2017 04:19 PM
The good news is even though the shell script didnt work, I was able to run the same python script using Spark Hivecontext using the Spark action in Hue->Workflow instead of Shell action.
The shell script is shexample7.sh:
-------------------------------------------------
#!/usr/bin/env bash
export PYTHONPATH=/usr/bin/python
export PYSPARK_PYTHON=/usr/bin/python
echo "starting..."
/usr/bin/spark-submit --master yarn-cluster pyexample.py
The python script is pyexample.py:
-----------------------------------------------
#!/usr/bin/env python
from pyspark import SparkContext
from pyspark.sql import HiveContext
sc = SparkContext("local", "pySpark Hive App")
# Create a Hive Context
hive_context = HiveContext(sc)
print "Reading Hive table..."
mytbl = hive_context.sql("SELECT * FROM xyzdb.testdata1")
print "Registering DataFrame as a table..."
mytbl.show() # Show first rows of dataframe
mytbl.printSchema()
The python job successfully displays the data but somehow the final status comes back as KILLED even though the python script ran and got back data from hive in stdout.