Member since
03-12-2018
1
Post
0
Kudos Received
0
Solutions
03-12-2018
06:13 PM
Hi, I've tried your article with a simpler example using HDP2.4.x. Instead of NLTK, I created a simple conda environment called jup (similar to https://www.anaconda.com/blog/developer-blog/conda-spark/) When I try to run a variant of your spark submit command with NLTK, I get path ./ANACONDA/jup does not exist. Where did you define NLTK in your example PYSPARK_PYTHON=./ANACONDA/jup/bin/python ..... I looked at the logs and it does not appear to be unzipping the zip file. I've added all the paths that I can get hold of as follows. Please note that if I drop .ANACONDA path and run spark locally then it works PYSPARK_PYTHON=./ANACONDA/jup/bin/python spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./ANACONDA/jup/bin/python --conf spark.yarn.executorEnv.PYSPARK_PYTHON=./ANACONDA/jup/bin/python --conf spark.yarn.appMasterEnv.PYTHONPATH="/usr/hdp/current/spark-client/python:/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip" --conf spark.executorEnv.PYTHONPATH="/usr/hdp/current/spark-client/python/:/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip" --conf spark.yarn.appMasterEnv.PYTHONSTARTUP="/usr/hdp/current/spark-client/python/pyspark/shell.py" --conf spark.yarn.appMasterEnv.SPARK_HOME="/usr/hdp/current/spark-client" --conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./ANACONDA/jup/bin/python --master yarn --archives /opt/app/anaconda3/envs/jup.zip#ANACONDA /home/d3849648/DSF/pysubmit2.py I can only get it to run if I drop ANACONDA from the pyspark path but I still get the following error The upload side 18/03/12 16:46:34 INFO Client: Using the spark assembly jar on HDFS because you are using HDP, defaultSparkAssembly:hdfs://HDP50/hdp/apps/2.4.2.0-258/spark/spark-hdp-assembly.jar
18/03/12 16:46:34 INFO Client: Source and destination file systems are the same. Not copying hdfs://HDP50/hdp/apps/2.4.2.0-258/spark/spark-hdp-assembly.jar
18/03/12 16:46:34 INFO Client: Uploading resource file:/opt/app/anaconda3/envs/jup.zip#ANACONDA -> hdfs://HDP50/user/d3849648/.sparkStaging/application_1520011290259_0032/jup.zip
18/03/12 16:46:39 INFO Client: Uploading resource file:/usr/hdp/2.4.2.0-258/spark/python/lib/pyspark.zip -> hdfs://HDP50/user/XX/.sparkStaging/application_1520011290259_0032/pyspark.zip
18/03/12 16:46:39 INFO Client: Uploading resource file:/usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip -> hdfs://HDP50/user/XX/.sparkStaging/application_1520011290259_0032/py4j-0.9-src.zip
18/03/12 16:46:39 INFO Client: Uploading resource file:/tmp/spark-3763c756-db41-4cd7-8cbe-7e48a1788e7d/__spark_conf__4305293861537090121.zip -> hdfs://HDP50/user/XX/.sparkStaging/application_1520011290259_0032/__spark_conf__4305293861537090121.zip 18/03/12 16:47:03 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, hkg3pl0244.hk.hsbc): java.io.IOException: Cannot run program "jup/bin/python": error=2, No such file or directory
... View more