Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

run a python script containing commands spark

avatar
Expert Contributor

Hello, I want to know is how can I run a python script that contains commands spark ?

Here is my python script that I would run into a python environment :

#!/usr/bin/python2.7

from pyspark.sql import HiveContext

from pyspark import SparkContext

from pandas.DataFrame.ix import DataFrame as df

hive_context = HiveContext(sc)

qvol1 = hive_context.table("table")

qvol2 = hive_context.table("table")

qvol1.registerTempTable("qvol1_temp")

qvol2.registerTempTable("qvol2_temp")

df=hive_context.sql("request")

df.show()

7 REPLIES 7

avatar
Expert Contributor

You can simply use spark-submit, which is in the bin folder of your spark-client installation. Here you can find the documentation for it: http://spark.apache.org/docs/latest/submitting-applications.html

avatar
Expert Contributor

Thank you. I managed to run it. Except that my file is local and when I specify the path of a file on the cluster, I receive an error:

bash-4.1$ spark-submit --master yarn-client --queue DES hdfs:///dev/datalake/app/des/dev/script/return.py Error: Only local python files are supported: Parsed arguments: master yarn-client deployMode client executorMemory null executorCores null totalExecutorCores null propertiesFile /usr/hdp/current/spark-client/conf/spark-defaults.conf driverMemory null driverCores null driverExtraClassPath /usr/hdp/current/share/lzo/0.6.0/lib/hadoop-lzo-0.6.0.jar:/usr/local/jdk-hadoop/ojdbc7.jar:/usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar:/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar:/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar:/usr/hdp/current/hbase-client/lib/hbase-protocol.jar:/usr/hdp/current/hbase-client/lib/hbase-hadoop-compat.jar:/usr/hdp/current/hbase-client/lib/metrics-core-2.2.0.jar driverExtraLibraryPath /usr/hdp/current/share/lzo/0.6.0/lib/native/Linux-amd64-64/ driverExtraJavaOptions null supervise false queue DES numExecutors null files null pyFiles null archives null mainClass null primaryResource hdfs:///dev/datalake/app/des/dev/script/return.py name return.py childArgs [] jars null packages null packagesExclusions null repositories null verbose false

avatar
Expert Contributor

Which is the problem using a local file? Indeed is what you have to do... There is no reason to specify the path of the file on hdfs.

avatar

@alain TSAFACK

I think you need the --files option to pass the python script to all executor instances. So for example:

./bin/spark-submit --class my.main.Class \
    --master yarn-cluster \
    --jars my-other-jar.jar,my-other-other-jar.jar
    --files return.py
    my-main-jar.jar
    app_arg1 app_arg2

avatar
Expert Contributor

Hello Paul Hargis

Here is the command that I run with the parameter --files but it generates me an error:

bash-4.1$ spark-submit --master yarn-cluster --queue DES --files hdfs://dev/datalake/app/des/dev/script/return.py

Error: Must specify a primary resource (JAR or Python or R file) Run with --help for usage help or --verbose for debug output

My cordial Thanks

avatar
Super Guru

I think you want to unit test this python script to do so, just lunch pyspark shell which will give you python repl where you can run each line one by one to test it.

avatar
Expert Contributor

Thank you.

But I've already done this step and I needed to handle multiple files. Currently this is solved thank you