Support Questions

Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

run a python script containing commands spark

Rising Star

Hello, I want to know is how can I run a python script that contains commands spark ?

Here is my python script that I would run into a python environment :


from pyspark.sql import HiveContext

from pyspark import SparkContext

from pandas.DataFrame.ix import DataFrame as df

hive_context = HiveContext(sc)

qvol1 = hive_context.table("table")

qvol2 = hive_context.table("table")





Rising Star

You can simply use spark-submit, which is in the bin folder of your spark-client installation. Here you can find the documentation for it:

Rising Star

Thank you. I managed to run it. Except that my file is local and when I specify the path of a file on the cluster, I receive an error:

bash-4.1$ spark-submit --master yarn-client --queue DES hdfs:///dev/datalake/app/des/dev/script/ Error: Only local python files are supported: Parsed arguments: master yarn-client deployMode client executorMemory null executorCores null totalExecutorCores null propertiesFile /usr/hdp/current/spark-client/conf/spark-defaults.conf driverMemory null driverCores null driverExtraClassPath /usr/hdp/current/share/lzo/0.6.0/lib/hadoop-lzo-0.6.0.jar:/usr/local/jdk-hadoop/ojdbc7.jar:/usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar:/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar:/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar:/usr/hdp/current/hbase-client/lib/hbase-protocol.jar:/usr/hdp/current/hbase-client/lib/hbase-hadoop-compat.jar:/usr/hdp/current/hbase-client/lib/metrics-core-2.2.0.jar driverExtraLibraryPath /usr/hdp/current/share/lzo/0.6.0/lib/native/Linux-amd64-64/ driverExtraJavaOptions null supervise false queue DES numExecutors null files null pyFiles null archives null mainClass null primaryResource hdfs:///dev/datalake/app/des/dev/script/ name childArgs [] jars null packages null packagesExclusions null repositories null verbose false

Rising Star

Which is the problem using a local file? Indeed is what you have to do... There is no reason to specify the path of the file on hdfs.

@alain TSAFACK

I think you need the --files option to pass the python script to all executor instances. So for example:

./bin/spark-submit --class my.main.Class \
    --master yarn-cluster \
    --jars my-other-jar.jar,my-other-other-jar.jar
    app_arg1 app_arg2

Rising Star

Hello Paul Hargis

Here is the command that I run with the parameter --files but it generates me an error:

bash-4.1$ spark-submit --master yarn-cluster --queue DES --files hdfs://dev/datalake/app/des/dev/script/

Error: Must specify a primary resource (JAR or Python or R file) Run with --help for usage help or --verbose for debug output

My cordial Thanks

I think you want to unit test this python script to do so, just lunch pyspark shell which will give you python repl where you can run each line one by one to test it.

Rising Star

Thank you.

But I've already done this step and I needed to handle multiple files. Currently this is solved thank you

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.