- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Jupyter notebook > ImportError: No module named numpy
Created on ‎07-31-2020 07:14 AM - edited ‎09-16-2022 07:38 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Have seen other topics with the same or similar subject name, in particular this one. Followed the hints, however they do not solve my problem, or it is unclear how to implement a solution. Hence let me create this alternate topic.
In a CDH 6.3.2 cluster have an Anaconda parcel distributed and activated, which of course has the numpy module installed. However the Spark nodes seem to ignore the CDH configuration and keep using the system wide Python from /usr/bin/python.
Nevertheless I have installed numpy in system wide Python across all cluster nodes. However I still experience the "ImportError: No module named numpy".
Would appreciate any further advice how to solve the problem.
Also not sure how to implement the solution referred in https://stackoverflow.com/questions/46857090/adding-pyspark-python-path-in-oozie. Any clarification much appreciated.
Here is the error extracted from a Jupyter notebook output:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Aborting TaskSet 1.0 because task 0 (partition 0)
cannot run anywhere due to node and executor blacklist.
Most recent failure:
Lost task 0.0 in stage 1.0 (TID 1, blc-worker-03.novalocal, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark/python/pyspark/worker.py", line 359, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark/python/pyspark/worker.py", line 64, in read_command
command = serializer._read_with_length(file)
File "/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark/python/pyspark/serializers.py", line 172, in _read_with_length
return self.loads(obj)
File "/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark/python/pyspark/serializers.py", line 580, in loads
return pickle.loads(obj)
File "/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/spark/python/pyspark/mllib/__init__.py", line 28, in <module>
import numpy
ImportError: No module named numpy
Created on ‎08-01-2020 08:06 AM - edited ‎08-01-2020 08:07 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Marek You need to install numpy package on all the node managers, Use below command to install numpy package and rerun the code.
# pip install numpy
Created ‎08-01-2020 09:07 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Marek The solution your are referring can be implemented like below in Oozie service Configuration:
<property>
<name>oozie.launcher.mapreduce.map.env</name>
<value>PYSPARK_DRIVER_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python,PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python</value>
</property>
Though installing numpy packages should resolve the issue.
Cheers!
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Created ‎08-03-2020 02:21 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@GangWar In which Oozie's service configuration item in Cloudera Manager this should be defined?
Created on ‎08-03-2020 01:31 AM - edited ‎08-03-2020 01:45 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Actually on a clean CentOS 7.6 a simple pip install numpy does not work – the command returns RuntimeError: Python version >= 3.6 required. Had to upgrade pip first, change default permission mask (if installed system wide by root, otherwise the installed numpy package is not readable by non-root users), and only then install numpy:
# pip install --upgrade pip
Collecting pip
[...]
# umask 022; pip install numpy
Nonetheless this workaround is not scalable (it should be managed/solved cluster wise from Cloudera Manager, not command line), and in contrast to Python/pip best practices (as pip should not be used for system wide (root) package installations). Hence still looking for a solution, how to make the PySpark script to use the Anaconda Python on the cluster nodes.
