Reply
Contributor
Posts: 44
Registered: ‎09-14-2017

ImportError: No module named pyspark from oozie job in hue

Hi, 

I am getting an error while running a python script using shell action in Hue/oozie. My workflow xml is given below. Any ideas? Thanks.

 

from pyspark import SparkContext
ImportError: No module named pyspark
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]

 

--------------------------------------------------------------------------------

<workflow-app name="My Workflow" xmlns="uri:oozie:workflow:0.5">
<start to="shell-8cca"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="shell-8cca">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>oozie.launcher.mapred.child.env</name>
<value>PYTHONPATH=/usr/bin/python</value>
</property>
<property>
<name>oozie.launcher.mapred.child.env</name>
<value>PYSPARK_PYTHON=/usr/bin/pyspark</value>
</property>
</configuration>
<exec>shexample7.sh</exec>
<env-var>PYTHONPATH=/usr/bin/python</env-var>
<env-var>PYSPARK_PYTHON=/usr/bin/pyspark</env-var>
<file>/user/admin/shexample7.sh#shexample7.sh</file>
<file>/user/admin/pyexample.py#pyexample.py</file>
<capture-output/>
</shell>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>

Cloudera Employee
Posts: 212
Registered: ‎03-23-2015

Re: ImportError: No module named pyspark from oozie job in hue

Can you please share the content of shexample7.sh? I would like to see how you launch the spark job in shell script.
Contributor
Posts: 44
Registered: ‎09-14-2017

Re: ImportError: No module named pyspark from oozie job in hue

[ Edited ]

 

 

The good news is even though the shell script didnt work, I was able to run the same python script using Spark Hivecontext using the Spark action in Hue->Workflow instead of Shell action.

 

The shell script is shexample7.sh:

-------------------------------------------------


#!/usr/bin/env bash

export PYTHONPATH=/usr/bin/python
export PYSPARK_PYTHON=/usr/bin/python

 

echo "starting..."

/usr/bin/spark-submit --master yarn-cluster pyexample.py

 

 

The python script is pyexample.py:

-----------------------------------------------

 

#!/usr/bin/env python

from pyspark import SparkContext
from pyspark.sql import HiveContext

 

sc = SparkContext("local", "pySpark Hive App")
# Create a Hive Context
hive_context = HiveContext(sc)

print "Reading Hive table..."


mytbl = hive_context.sql("SELECT * FROM xyzdb.testdata1")

print "Registering DataFrame as a table..."
mytbl.show() # Show first rows of dataframe
mytbl.printSchema()

 

 

The python job successfully displays the data but somehow the final status comes back as KILLED even though the python script ran and got back data from hive in stdout.

Announcements