Created 06-27-2016 07:16 AM
Hello, I want to know is how can I run a python script that contains commands spark ?
Here is my python script that I would run into a python environment :
#!/usr/bin/python2.7
from pyspark.sql import HiveContext
from pyspark import SparkContext
from pandas.DataFrame.ix import DataFrame as df
hive_context = HiveContext(sc)
qvol1 = hive_context.table("table")
qvol2 = hive_context.table("table")
qvol1.registerTempTable("qvol1_temp")
qvol2.registerTempTable("qvol2_temp")
df=hive_context.sql("request")
df.show()
Created 06-27-2016 07:18 AM
You can simply use spark-submit, which is in the bin folder of your spark-client installation. Here you can find the documentation for it: http://spark.apache.org/docs/latest/submitting-applications.html
Created 06-27-2016 02:04 PM
Thank you. I managed to run it. Except that my file is local and when I specify the path of a file on the cluster, I receive an error:
bash-4.1$ spark-submit --master yarn-client --queue DES hdfs:///dev/datalake/app/des/dev/script/return.py Error: Only local python files are supported: Parsed arguments: master yarn-client deployMode client executorMemory null executorCores null totalExecutorCores null propertiesFile /usr/hdp/current/spark-client/conf/spark-defaults.conf driverMemory null driverCores null driverExtraClassPath /usr/hdp/current/share/lzo/0.6.0/lib/hadoop-lzo-0.6.0.jar:/usr/local/jdk-hadoop/ojdbc7.jar:/usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar:/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar:/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar:/usr/hdp/current/hbase-client/lib/hbase-protocol.jar:/usr/hdp/current/hbase-client/lib/hbase-hadoop-compat.jar:/usr/hdp/current/hbase-client/lib/metrics-core-2.2.0.jar driverExtraLibraryPath /usr/hdp/current/share/lzo/0.6.0/lib/native/Linux-amd64-64/ driverExtraJavaOptions null supervise false queue DES numExecutors null files null pyFiles null archives null mainClass null primaryResource hdfs:///dev/datalake/app/des/dev/script/return.py name return.py childArgs [] jars null packages null packagesExclusions null repositories null verbose false
Created 06-28-2016 07:11 AM
Which is the problem using a local file? Indeed is what you have to do... There is no reason to specify the path of the file on hdfs.
Created 06-27-2016 04:51 PM
I think you need the --files option to pass the python script to all executor instances. So for example:
./bin/spark-submit --class my.main.Class \ --master yarn-cluster \ --jars my-other-jar.jar,my-other-other-jar.jar --files return.py my-main-jar.jar app_arg1 app_arg2
Created 06-28-2016 06:42 AM
Hello Paul Hargis
Here is the command that I run with the parameter --files but it generates me an error:
bash-4.1$ spark-submit --master yarn-cluster --queue DES --files hdfs://dev/datalake/app/des/dev/script/return.py
Error: Must specify a primary resource (JAR or Python or R file) Run with --help for usage help or --verbose for debug output
My cordial Thanks
Created 06-27-2016 07:36 AM
I think you want to unit test this python script to do so, just lunch pyspark shell which will give you python repl where you can run each line one by one to test it.
Created 06-27-2016 12:11 PM
Thank you.
But I've already done this step and I needed to handle multiple files. Currently this is solved thank you