Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

CDSW Error: No module named numpy???

avatar
Explorer

Hi

 

I've just installed Data Science Workbench 1.2 on a single Master Node (under VMWARE 6.5). From my understanding of the documentation adding Worker Nodes is optional. The service comes up under the cluster okay and on Cloudera Manager (5.13) it has Green Health. Although when I run the commend cdsw status on the master node CLI it reports 'Cloudera Data Science Workbench is not ready yet'. It says 'Status check failed for services: [docker, kubelet, cdsw-app, cdsw-host-controller]'.

 

I can open a project successfully and some example pyspark files work fine. But any pyspark script that uses numpy gives the error:

 

File "/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 711, in subimport
    __import__(name)
ImportError: ('No module named numpy', <function subimport at 0x1e75cf8>, ('numpy',))

 

When I issue the commend pip list on the session terminal it lists numpy (1.12.1) as being installed. 

 

Any advice on fixing this would be much appreciated.

 

Thanks a lot.

 

Rob Sullivan (London)

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Hello Rob,

 

You are trying to use the numpy library from inside a map functions iterating over RDDs. The transformation which you specify will run on the Executors which will be hosted on different machines where you have YARN NodeManager running. To make this work you need to make sure that the numpy library is installed on all of the NodeManager machines.

 

Regards,

Peter

View solution in original post

9 REPLIES 9

avatar
Expert Contributor

Hi Rob,

 

Could you give a minimal example of a script that fails with this error?  I'm specifically interested in where the import of numpy is being done.

 

Thanks,

Tristan

avatar
Explorer

Thanks for reply Tristan.

 

The kmeans.py script provided in the CDSW is the one I've tried.

 

Rob.

 

# # K-Means
#
# The K-means algorithm written from scratch against PySpark. In practice,
# one may prefer to use the KMeans algorithm in ML, as shown in
# [this example](https://github.com/apache/spark/blob/master/examples/src/main/python/ml/kmeans_example.py).
# 
# This example requires [NumPy](http://www.numpy.org/).

from __future__ import print_function
import sys
import numpy as np
from pyspark.sql import SparkSession

def parseVector(line):
    return np.array([float(x) for x in line.split(' ')])

def closestPoint(p, centers):
    bestIndex = 0
    closest = float("+inf")
    for i in range(len(centers)):
        tempDist = np.sum((p - centers[i]) ** 2)
        if tempDist < closest:
            closest = tempDist
            bestIndex = i
    return bestIndex

spark = SparkSession\
    .builder\
    .appName("PythonKMeans")\
    .getOrCreate()

# Add the data file to hdfs.
!hdfs dfs -put resources/data/mllib/kmeans_data.txt /tmp

lines = spark.read.text("/tmp/kmeans_data.txt").rdd.map(lambda r: r[0])
data = lines.map(parseVector).cache()
K = 2
convergeDist = 0.1

kPoints = data.takeSample(False, K, 1)
tempDist = 1.0

while tempDist > convergeDist:
    closest = data.map(
        lambda p: (closestPoint(p, kPoints), (p, 1)))
    pointStats = closest.reduceByKey(
        lambda p1_c1, p2_c2: (p1_c1[0] + p2_c2[0], p1_c1[1] + p2_c2[1]))
    newPoints = pointStats.map(
        lambda st: (st[0], st[1][0] / st[1][1])).collect()

    tempDist = sum(np.sum((kPoints[iK] - p) ** 2) for (iK, p) in newPoints)

    for (iK, p) in newPoints:
        kPoints[iK] = p

print("Final centers: " + str(kPoints))

spark.stop()

 

avatar
Super Collaborator

Hello Rob,

 

You are trying to use the numpy library from inside a map functions iterating over RDDs. The transformation which you specify will run on the Executors which will be hosted on different machines where you have YARN NodeManager running. To make this work you need to make sure that the numpy library is installed on all of the NodeManager machines.

 

Regards,

Peter

avatar
Explorer

Huge thanks Peter, the example scripts calling Numpy now work fine.

 

I guess I'm a little confused though - the CDSW documentation talks about installing python packages inside the docker session so that they're islolated. But if you still need all these packages installed on the Yarn machines doesn't this defeat the object?

 

Rob.

avatar
Super Collaborator

Hi Rob,

 

This is a good question. CDSW gives you isolation by having project specific dependencies stored in the project folders out of the box. If you need isolated dependencies on the Cluster side (Spark Executor side) also you need to follow the steps described in this blog post: 

https://blog.cloudera.com/blog/2017/04/use-your-favorite-python-library-on-pyspark-cluster-with-clou...

 

The "Cloudera Data Science Workbench is not ready yet" output for the cdsw status message while everything looks good is a known issue which we currently working on.

 

Regards,

Peter

avatar
Explorer

Thanks Peter.

 

On the cdsw status point, that's good to know. I'm just an amateur, exploring Cloudera on my VMWare homelab setup. It's been fun but a bit of a herculean effort getting CDH/CDM/CDSW setup correctly, but very satisying to know I now have a working setup I can experiment with. It's a pity there isnt a free license option for CDSW.

 

Kind Regards

 

Rob.

avatar
New Contributor

Hi Rob,

        What did you do to get this resolved?

 

Peter,

           I am facing similar issue, when trying to import requests module in foreachRDD. I am running in local mode and have this library available on the host. I am facing the below error.

 

  File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 710, in subimport
    __import__(name)
ImportError: ('No module named requests', <function subimport at 0x7f2671268488>, ('requests',))

avatar
Expert Contributor
It sounds like requests is not installed on your executors. You could
manually install these libraries on all executors or ship it using Spark
following the techniques outlined in this blog post:
https://blog.cloudera.com/blog/2017/04/use-your-favorite-python-library-on-pyspark-cluster-with-clou...
.

Tristan

avatar
Explorer

Peter

 

One more question if I may. Although CDSW indicates all health tests as green, when I issue the cdsw status command on the master node I get the following output:

 

Thanks a lot

 

Rob (London)

 

-------------------------------

 

[root@dsw-master ~]# cdsw status
Sending detailed logs to [/tmp/cdsw_status_cHSgF9.log] ...
CDSW Version: [1.2.0:d573dd7]
OK: Application running as root check
Status check failed for services: [docker, kubelet, cdsw-app, cdsw-host-controller]
OK: Sysctl params check
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| NAME | STATUS | CREATED-AT | VERSION | EXTERNAL-IP | OS-IMAGE | KERNEL-VERSION | GPU | STATEFUL |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| dsw-master.femto.lab | True | 2017-11-06 15:04:05+00:00 | v1.6.11 | None | CentOS Linux 7 (Core) | 3.10.0-514.el7.x86_64 | 0 | True |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1/1 nodes are ready.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| NAME | READY | STATUS | RESTARTS | CREATED-AT | POD-IP | HOST-IP | ROLE |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| etcd-dsw-master.femto.lab | 1/1 | Running | 0 | 2017-11-06 15:05:13+00:00 | 192.168.1.94 | 192.168.1.94 | None |
| kube-apiserver-dsw-master.femto.lab | 1/1 | Running | 0 | 2017-11-06 15:04:02+00:00 | 192.168.1.94 | 192.168.1.94 | None |
| kube-controller-manager-dsw-master.femto.lab | 1/1 | Running | 0 | 2017-11-06 15:05:13+00:00 | 192.168.1.94 | 192.168.1.94 | None |
| kube-dns-3911048160-bsxrk | 3/3 | Running | 0 | 2017-11-06 15:04:18+00:00 | 100.66.0.2 | 192.168.1.94 | None |
| kube-proxy-7mz7x | 1/1 | Running | 0 | 2017-11-06 15:04:19+00:00 | 192.168.1.94 | 192.168.1.94 | None |
| kube-scheduler-dsw-master.femto.lab | 1/1 | Running | 0 | 2017-11-06 15:04:02+00:00 | 192.168.1.94 | 192.168.1.94 | None |
| node-problem-detector-v0.1-ccnvs | 1/1 | Running | 0 | 2017-11-06 15:05:36+00:00 | 192.168.1.94 | 192.168.1.94 | None |
| weave-net-b11mc | 2/2 | Running | 0 | 2017-11-06 15:04:19+00:00 | 192.168.1.94 | 192.168.1.94 | None |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
All required pods are ready in cluster kube-system.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| NAME | READY | STATUS | RESTARTS | CREATED-AT | POD-IP | HOST-IP | ROLE |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| cron-962987953-pzq31 | 1/1 | Running | 0 | 2017-11-06 15:05:36+00:00 | 100.66.0.3 | 192.168.1.94 | cron |
| db-875553086-tvss8 | 1/1 | Running | 0 | 2017-11-06 15:05:35+00:00 | 100.66.0.7 | 192.168.1.94 | db |
| db-migrate-d573dd7-hzrbq | 0/1 | Succeeded | 0 | 2017-11-06 15:05:35+00:00 | 100.66.0.6 | 192.168.1.94 | db-migrate |
| engine-deps-jk1sq | 1/1 | Running | 0 | 2017-11-06 15:05:35+00:00 | 100.66.0.5 | 192.168.1.94 | engine-deps |
| ingress-controller-506514573-9fftg | 1/1 | Running | 0 | 2017-11-06 15:05:35+00:00 | 192.168.1.94 | 192.168.1.94 | ingress-controller |
| livelog-1589742313-d3lzb | 1/1 | Running | 0 | 2017-11-06 15:05:35+00:00 | 100.66.0.4 | 192.168.1.94 | livelog |
| reconciler-1584998901-txmr7 | 1/1 | Running | 0 | 2017-11-06 15:05:35+00:00 | 100.66.0.8 | 192.168.1.94 | reconciler |
| spark-port-forwarder-7k9rd | 1/1 | Running | 0 | 2017-11-06 15:05:36+00:00 | 192.168.1.94 | 192.168.1.94 | spark-port-forwarder |
| web-53233289-krp05 | 1/1 | Running | 0 | 2017-11-06 15:05:36+00:00 | 100.66.0.9 | 192.168.1.94 | web |
| web-53233289-ss5t3 | 1/1 | Running | 0 | 2017-11-06 15:05:36+00:00 | 100.66.0.10 | 192.168.1.94 | web |
| web-53233289-twk6p | 1/1 | Running | 0 | 2017-11-06 15:05:36+00:00 | 100.66.0.11 | 192.168.1.94 | web |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
All required pods are ready in cluster default.
All required Application services are configured.
All required config maps are ready.
All required secrets are available.
Persistent volumes are ready.
Persistent volume claims are ready.
Ingresses are ready.
OK: HTTP port check
Cloudera Data Science Workbench is not ready yet
[root@dsw-master ~]#