Created on 03-28-2017 03:22 PM - edited 09-16-2022 04:21 AM
I have an intermittent issue. I've read the other threads regarding numpy not found on this site and other places on the web to solve my problem, but it keeps coming back after I re-deploy client configurations.
I am running a Spark job through HUE->Oozie and using pyspark's MLlib which requires numpy.
Initially, I read the Cloudera docs and blog indicating to install numpy to each node (Anaconda isn't an option for me). I installed numpy on each node using yum as root (I didn't create a virtual environment for this). This worked. However, I later re-deployed the client configurations through CM for reasons unrelated to this issue, and I received the numpy not found error again.
At this point I went to the configuration page for Spark in CM to set the variables:
PYSPARK_PYTHON=/usr/lib64/python2.7
PYSPARK_DRIVER_PYTHON=/usr/lib64/python2.7
in Spark Service Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh.
Next, I re-deployed client configurations. It started working again. However, yet again after re-deploying later on for reasons unrelated to this issue, I got numpy not found again.
So it seems that it just keeps coming back and only lasts for one deployment when it does work. I also looked into checking permissions for the python paths, and I don't see any issues there but I may be missing something.
Could this be related to running it through HUE or Oozie?
Are the environment variables I set nto the correct paths?
Any help is appreciated. Thanks!
Created on 04-19-2017 10:41 AM - edited 04-19-2017 10:42 AM
In case anyone else has this issue, the documentation for CDH 5.10 is incorrect.
It says to set PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON in Spark Service Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh. I imagine this would be correct if you run Spark in stand-alone mode.
However, if you run in yarn-client or yarn-cluster, the PYSPARK_PYTHON variable has to be set in YARN. The driver variable isn't relevant. It appears to be only relvent if you want to run it through a notebook. I didn't have to do any of the steps the docs say to do for yarn-cluster either.
YARN (MR2 Included) Service Environment Advanced Configuration Snippet (Safety Valve)
PYSPARK_PYTHON="/usr/bin/python"
Created 03-28-2017 10:28 PM
I think the recommended way to manage this without using Anaconda is to use the Anaconda-based parcel for CDH, which will lay down a basic version of dependencies like numpy and should plumb the necessary configuration to use that.
Created on 03-30-2017 11:14 AM - edited 03-30-2017 11:16 AM
Unfortunately, Anaconda isn't an option for me.
I also added "export" to my safety valve changes for the 2 python variables but numpy still cannot be found.
Created on 04-19-2017 10:41 AM - edited 04-19-2017 10:42 AM
In case anyone else has this issue, the documentation for CDH 5.10 is incorrect.
It says to set PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON in Spark Service Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh. I imagine this would be correct if you run Spark in stand-alone mode.
However, if you run in yarn-client or yarn-cluster, the PYSPARK_PYTHON variable has to be set in YARN. The driver variable isn't relevant. It appears to be only relvent if you want to run it through a notebook. I didn't have to do any of the steps the docs say to do for yarn-cluster either.
YARN (MR2 Included) Service Environment Advanced Configuration Snippet (Safety Valve)
PYSPARK_PYTHON="/usr/bin/python"