Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

ImportError: No module named pyspark from oozie job in hue

avatar
Expert Contributor

Hi, 

I am getting an error while running a python script using shell action in Hue/oozie. My workflow xml is given below. Any ideas? Thanks.

 

from pyspark import SparkContext
ImportError: No module named pyspark
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]

 

--------------------------------------------------------------------------------

<workflow-app name="My Workflow" xmlns="uri:oozie:workflow:0.5">
<start to="shell-8cca"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="shell-8cca">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>oozie.launcher.mapred.child.env</name>
<value>PYTHONPATH=/usr/bin/python</value>
</property>
<property>
<name>oozie.launcher.mapred.child.env</name>
<value>PYSPARK_PYTHON=/usr/bin/pyspark</value>
</property>
</configuration>
<exec>shexample7.sh</exec>
<env-var>PYTHONPATH=/usr/bin/python</env-var>
<env-var>PYSPARK_PYTHON=/usr/bin/pyspark</env-var>
<file>/user/admin/shexample7.sh#shexample7.sh</file>
<file>/user/admin/pyexample.py#pyexample.py</file>
<capture-output/>
</shell>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>

2 REPLIES 2

avatar
Super Guru
Can you please share the content of shexample7.sh? I would like to see how you launch the spark job in shell script.

avatar
Expert Contributor

 

 

The good news is even though the shell script didnt work, I was able to run the same python script using Spark Hivecontext using the Spark action in Hue->Workflow instead of Shell action.

 

The shell script is shexample7.sh:

-------------------------------------------------


#!/usr/bin/env bash

export PYTHONPATH=/usr/bin/python
export PYSPARK_PYTHON=/usr/bin/python

 

echo "starting..."

/usr/bin/spark-submit --master yarn-cluster pyexample.py

 

 

The python script is pyexample.py:

-----------------------------------------------

 

#!/usr/bin/env python

from pyspark import SparkContext
from pyspark.sql import HiveContext

 

sc = SparkContext("local", "pySpark Hive App")
# Create a Hive Context
hive_context = HiveContext(sc)

print "Reading Hive table..."


mytbl = hive_context.sql("SELECT * FROM xyzdb.testdata1")

print "Registering DataFrame as a table..."
mytbl.show() # Show first rows of dataframe
mytbl.printSchema()

 

 

The python job successfully displays the data but somehow the final status comes back as KILLED even though the python script ran and got back data from hive in stdout.