Support Questions

nanyim_alain · ‎06-27-2016

Hello, I want to know is how can I run a python script that contains commands spark ?

Here is my python script that I would run into a python environment :

#!/usr/bin/python2.7

from pyspark.sql import HiveContext

from pyspark import SparkContext

from pandas.DataFrame.ix import DataFrame as df

hive_context = HiveContext(sc)

qvol1 = hive_context.table("table")

qvol2 = hive_context.table("table")

qvol1.registerTempTable("qvol1_temp")

qvol2.registerTempTable("qvol2_temp")

df=hive_context.sql("request")

df.show()

mgaido · ‎06-27-2016

You can simply use spark-submit, which is in the bin folder of your spark-client installation. Here you can find the documentation for it: http://spark.apache.org/docs/latest/submitting-applications.html

nanyim_alain · ‎06-27-2016

Thank you. I managed to run it. Except that my file is local and when I specify the path of a file on the cluster, I receive an error:

bash-4.1$ spark-submit --master yarn-client --queue DES hdfs:///dev/datalake/app/des/dev/script/return.py Error: Only local python files are supported: Parsed arguments: master yarn-client deployMode client executorMemory null executorCores null totalExecutorCores null propertiesFile /usr/hdp/current/spark-client/conf/spark-defaults.conf driverMemory null driverCores null driverExtraClassPath /usr/hdp/current/share/lzo/0.6.0/lib/hadoop-lzo-0.6.0.jar:/usr/local/jdk-hadoop/ojdbc7.jar:/usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar:/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar:/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar:/usr/hdp/current/hbase-client/lib/hbase-protocol.jar:/usr/hdp/current/hbase-client/lib/hbase-hadoop-compat.jar:/usr/hdp/current/hbase-client/lib/metrics-core-2.2.0.jar driverExtraLibraryPath /usr/hdp/current/share/lzo/0.6.0/lib/native/Linux-amd64-64/ driverExtraJavaOptions null supervise false queue DES numExecutors null files null pyFiles null archives null mainClass null primaryResource hdfs:///dev/datalake/app/des/dev/script/return.py name return.py childArgs [] jars null packages null packagesExclusions null repositories null verbose false

mgaido · ‎06-28-2016

Which is the problem using a local file? Indeed is what you have to do... There is no reason to specify the path of the file on hdfs.

phargis · ‎06-27-2016

@alain TSAFACK

I think you need the --files option to pass the python script to all executor instances. So for example:

./bin/spark-submit --class my.main.Class \
    --master yarn-cluster \
    --jars my-other-jar.jar,my-other-other-jar.jar
    --files return.py
    my-main-jar.jar
    app_arg1 app_arg2

nanyim_alain · ‎06-28-2016

Hello Paul Hargis

Here is the command that I run with the parameter --files but it generates me an error:

bash-4.1$ spark-submit --master yarn-cluster --queue DES --files hdfs://dev/datalake/app/des/dev/script/return.py

Error: Must specify a primary resource (JAR or Python or R file) Run with --help for usage help or --verbose for debug output

My cordial Thanks

rajkumar_singh · ‎06-27-2016

I think you want to unit test this python script to do so, just lunch pyspark shell which will give you python repl where you can run each line one by one to test it.

nanyim_alain · ‎06-27-2016

Thank you.

But I've already done this step and I needed to handle multiple files. Currently this is solved thank you

Cloudera Community

Support Questions

run a python script containing commands spark