Have seen other topics with the same or similar subject name, in particular this one. Followed the hints, however they do not solve my problem, or it is unclear how to implement a solution. Hence let me create this alternate topic.
In a CDH 6.3.2 cluster have an Anaconda parcel distributed and activated, which of course has the numpy module installed. However the Spark nodes seem to ignore the CDH configuration and keep using the system wide Python from /usr/bin/python.
Nevertheless I have installed numpy in system wide Python across all cluster nodes. However I still experience the "ImportError: No module named numpy".
Would appreciate any further advice how to solve the problem.
Here is the error extracted from a Jupyter notebook output:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Aborting TaskSet 1.0 because task 0 (partition 0)
cannot run anywhere due to node and executor blacklist.
Most recent failure:
Lost task 0.0 in stage 1.0 (TID 1, blc-worker-03.novalocal, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark/python/pyspark/worker.py", line 359, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark/python/pyspark/worker.py", line 64, in read_command
command = serializer._read_with_length(file)
File "/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark/python/pyspark/serializers.py", line 172, in _read_with_length
File "/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark/python/pyspark/serializers.py", line 580, in loads
File "/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark/python/pyspark/mllib/__init__.py", line 28, in <module>
ImportError: No module named numpy
Actually on a clean CentOS 7.6 a simple pip install numpy does not work – the command returns RuntimeError:Python version >=3.6 required. Had to upgrade pip first, change default permission mask (if installed system wide by root, otherwise the installed numpy package is not readable by non-root users), and only then install numpy:
Nonetheless this workaround is not scalable (it should be managed/solved cluster wise from Cloudera Manager, not command line), and in contrast to Python/pip best practices (as pip should not be used for system wide (root) package installations). Hence still looking for a solution, how to make the PySpark script to use the Anaconda Python on the cluster nodes.