Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

pyspark in oozie SparkAction

pyspark in oozie SparkAction

New Contributor

I am not able to successfully run a Oozie SparkAction workflow with python code. I always receive the following:

 

>>> Invoking Spark class now >>>

Traceback (most recent call last):
  File "/data/2/yarn/nm/usercache/jozin/appcache/application_1462890728975_0964/container_1462890728975_0964_01_000001/spark_easy.py", line 1, in <module>
    from pyspark import SparkContext
ImportError: No module named pyspark
Intercepting System.exit(1)

Cluster:

CDH 5.7.0

Oozie 4.1.0

Spark  1.6.0

 

Workflow settings:

oozie.use.system.libpath: true

oozie.libpath:                    /user/jozin/spark_easy.py

 

Spark job:

Spark master:    yarn-client

Mode:               client

Appname:         spark_test

Jars/py fiels:     spark_easy.py

 

spark_easy.py:

from pyspark import SparkContext
sc = SparkContext()

 

Has anybody resolved this yet? Thanks

 

1 REPLY 1

Re: pyspark in oozie SparkAction

Rising Star

Can you please share your "workflow.xml" and "job.properties" files?

 

Also, can you try to adapt your "spark_easy.py" as follows and give it a try? Like:

 

from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import *

sconf = SparkConf().setAppName("SparkEasy").set("spark.driver.memory", "1g")
sc = SparkContext(conf=sconf)
sqlCtx = HiveContext(sc)

simple_DF = sqlCtx.sql("select * from <WHATEVER_EXISTING_TABLE_HERE>")

HTH

Don't have an account?
Coming from Hortonworks? Activate your account here