Support Questions

Marek · ‎07-31-2020

Have seen other topics with the same or similar subject name, in particular this one. Followed the hints, however they do not solve my problem, or it is unclear how to implement a solution. Hence let me create this alternate topic.

In a CDH 6.3.2 cluster have an Anaconda parcel distributed and activated, which of course has the numpy module installed. However the Spark nodes seem to ignore the CDH configuration and keep using the system wide Python from /usr/bin/python.

Nevertheless I have installed numpy in system wide Python across all cluster nodes. However I still experience the "ImportError: No module named numpy".

Would appreciate any further advice how to solve the problem.

Also not sure how to implement the solution referred in https://stackoverflow.com/questions/46857090/adding-pyspark-python-path-in-oozie. Any clarification much appreciated.

Here is the error extracted from a Jupyter notebook output:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: 
Aborting TaskSet 1.0 because task 0 (partition 0)
cannot run anywhere due to node and executor blacklist.
Most recent failure:
Lost task 0.0 in stage 1.0 (TID 1, blc-worker-03.novalocal, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark/python/pyspark/worker.py", line 359, in main
    func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  File "/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark/python/pyspark/worker.py", line 64, in read_command
    command = serializer._read_with_length(file)
  File "/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark/python/pyspark/serializers.py", line 172, in _read_with_length
    return self.loads(obj)
  File "/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark/python/pyspark/serializers.py", line 580, in loads
    return pickle.loads(obj)
  File "/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark/python/pyspark/mllib/__init__.py", line 28, in <module>
    import numpy
ImportError: No module named numpy

Scharan · ‎08-01-2020

@Marek You need to install numpy package on all the node managers, Use below command to install numpy package and rerun the code.

# pip install numpy

GangWar · ‎08-01-2020

@Marek The solution your are referring can be implemented like below in Oozie service Configuration:

<property>
<name>oozie.launcher.mapreduce.map.env</name>
<value>PYSPARK_DRIVER_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python,PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python</value>
</property>

Though installing numpy packages should resolve the issue.

Cheers!
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Marek · ‎08-03-2020

@GangWar In which Oozie's service configuration item in Cloudera Manager this should be defined?

Marek · ‎08-03-2020

Actually on a clean CentOS 7.6 a simple pip install numpy does not work – the command returns RuntimeError: Python version >= 3.6 required. Had to upgrade pip first, change default permission mask (if installed system wide by root, otherwise the installed numpy package is not readable by non-root users), and only then install numpy:

# pip install --upgrade pip
Collecting pip
[...]
# umask 022; pip install numpy

Nonetheless this workaround is not scalable (it should be managed/solved cluster wise from Cloudera Manager, not command line), and in contrast to Python/pip best practices (as pip should not be used for system wide (root) package installations). Hence still looking for a solution, how to make the PySpark script to use the Anaconda Python on the cluster nodes.

Cloudera Community

Support Questions

Jupyter notebook > ImportError: No module named numpy