Reply
Highlighted
New Contributor
Posts: 5
Registered: ‎11-06-2017
Accepted Solution

CDSW Error: No module named numpy???

[ Edited ]

Hi

 

I've just installed Data Science Workbench 1.2 on a single Master Node (under VMWARE 6.5). From my understanding of the documentation adding Worker Nodes is optional. The service comes up under the cluster okay and on Cloudera Manager (5.13) it has Green Health. Although when I run the commend cdsw status on the master node CLI it reports 'Cloudera Data Science Workbench is not ready yet'. It says 'Status check failed for services: [docker, kubelet, cdsw-app, cdsw-host-controller]'.

 

I can open a project successfully and some example pyspark files work fine. But any pyspark script that uses numpy gives the error:

 

File "/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 711, in subimport
    __import__(name)
ImportError: ('No module named numpy', <function subimport at 0x1e75cf8>, ('numpy',))

 

When I issue the commend pip list on the session terminal it lists numpy (1.12.1) as being installed. 

 

Any advice on fixing this would be much appreciated.

 

Thanks a lot.

 

Rob Sullivan (London)

Cloudera Employee
Posts: 31
Registered: ‎04-28-2017

Re: CDSW Error: No module named numpy???

Hi Rob,

 

Could you give a minimal example of a script that fails with this error?  I'm specifically interested in where the import of numpy is being done.

 

Thanks,

Tristan

New Contributor
Posts: 5
Registered: ‎11-06-2017

Re: CDSW Error: No module named numpy???

Thanks for reply Tristan.

 

The kmeans.py script provided in the CDSW is the one I've tried.

 

Rob.

 

# # K-Means
#
# The K-means algorithm written from scratch against PySpark. In practice,
# one may prefer to use the KMeans algorithm in ML, as shown in
# [this example](https://github.com/apache/spark/blob/master/examples/src/main/python/ml/kmeans_example.py).
# 
# This example requires [NumPy](http://www.numpy.org/).

from __future__ import print_function
import sys
import numpy as np
from pyspark.sql import SparkSession

def parseVector(line):
    return np.array([float(x) for x in line.split(' ')])

def closestPoint(p, centers):
    bestIndex = 0
    closest = float("+inf")
    for i in range(len(centers)):
        tempDist = np.sum((p - centers[i]) ** 2)
        if tempDist < closest:
            closest = tempDist
            bestIndex = i
    return bestIndex

spark = SparkSession\
    .builder\
    .appName("PythonKMeans")\
    .getOrCreate()

# Add the data file to hdfs.
!hdfs dfs -put resources/data/mllib/kmeans_data.txt /tmp

lines = spark.read.text("/tmp/kmeans_data.txt").rdd.map(lambda r: r[0])
data = lines.map(parseVector).cache()
K = 2
convergeDist = 0.1

kPoints = data.takeSample(False, K, 1)
tempDist = 1.0

while tempDist > convergeDist:
    closest = data.map(
        lambda p: (closestPoint(p, kPoints), (p, 1)))
    pointStats = closest.reduceByKey(
        lambda p1_c1, p2_c2: (p1_c1[0] + p2_c2[0], p1_c1[1] + p2_c2[1]))
    newPoints = pointStats.map(
        lambda st: (st[0], st[1][0] / st[1][1])).collect()

    tempDist = sum(np.sum((kPoints[iK] - p) ** 2) for (iK, p) in newPoints)

    for (iK, p) in newPoints:
        kPoints[iK] = p

print("Final centers: " + str(kPoints))

spark.stop()

 

Cloudera Employee
Posts: 27
Registered: ‎07-09-2015

Re: CDSW Error: No module named numpy???

Hello Rob,

 

You are trying to use the numpy library from inside a map functions iterating over RDDs. The transformation which you specify will run on the Executors which will be hosted on different machines where you have YARN NodeManager running. To make this work you need to make sure that the numpy library is installed on all of the NodeManager machines.

 

Regards,

Peter

New Contributor
Posts: 5
Registered: ‎11-06-2017

Re: CDSW Error: No module named numpy???

Huge thanks Peter, the example scripts calling Numpy now work fine.

 

I guess I'm a little confused though - the CDSW documentation talks about installing python packages inside the docker session so that they're islolated. But if you still need all these packages installed on the Yarn machines doesn't this defeat the object?

 

Rob.

New Contributor
Posts: 5
Registered: ‎11-06-2017

Re: CDSW Error: No module named numpy???

Peter

 

One more question if I may. Although CDSW indicates all health tests as green, when I issue the cdsw status command on the master node I get the following output:

 

Thanks a lot

 

Rob (London)

 

-------------------------------

 

[root@dsw-master ~]# cdsw status
Sending detailed logs to [/tmp/cdsw_status_cHSgF9.log] ...
CDSW Version: [1.2.0:d573dd7]
OK: Application running as root check
Status check failed for services: [docker, kubelet, cdsw-app, cdsw-host-controller]
OK: Sysctl params check
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| NAME | STATUS | CREATED-AT | VERSION | EXTERNAL-IP | OS-IMAGE | KERNEL-VERSION | GPU | STATEFUL |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| dsw-master.femto.lab | True | 2017-11-06 15:04:05+00:00 | v1.6.11 | None | CentOS Linux 7 (Core) | 3.10.0-514.el7.x86_64 | 0 | True |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1/1 nodes are ready.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| NAME | READY | STATUS | RESTARTS | CREATED-AT | POD-IP | HOST-IP | ROLE |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| etcd-dsw-master.femto.lab | 1/1 | Running | 0 | 2017-11-06 15:05:13+00:00 | 192.168.1.94 | 192.168.1.94 | None |
| kube-apiserver-dsw-master.femto.lab | 1/1 | Running | 0 | 2017-11-06 15:04:02+00:00 | 192.168.1.94 | 192.168.1.94 | None |
| kube-controller-manager-dsw-master.femto.lab | 1/1 | Running | 0 | 2017-11-06 15:05:13+00:00 | 192.168.1.94 | 192.168.1.94 | None |
| kube-dns-3911048160-bsxrk | 3/3 | Running | 0 | 2017-11-06 15:04:18+00:00 | 100.66.0.2 | 192.168.1.94 | None |
| kube-proxy-7mz7x | 1/1 | Running | 0 | 2017-11-06 15:04:19+00:00 | 192.168.1.94 | 192.168.1.94 | None |
| kube-scheduler-dsw-master.femto.lab | 1/1 | Running | 0 | 2017-11-06 15:04:02+00:00 | 192.168.1.94 | 192.168.1.94 | None |
| node-problem-detector-v0.1-ccnvs | 1/1 | Running | 0 | 2017-11-06 15:05:36+00:00 | 192.168.1.94 | 192.168.1.94 | None |
| weave-net-b11mc | 2/2 | Running | 0 | 2017-11-06 15:04:19+00:00 | 192.168.1.94 | 192.168.1.94 | None |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
All required pods are ready in cluster kube-system.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| NAME | READY | STATUS | RESTARTS | CREATED-AT | POD-IP | HOST-IP | ROLE |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| cron-962987953-pzq31 | 1/1 | Running | 0 | 2017-11-06 15:05:36+00:00 | 100.66.0.3 | 192.168.1.94 | cron |
| db-875553086-tvss8 | 1/1 | Running | 0 | 2017-11-06 15:05:35+00:00 | 100.66.0.7 | 192.168.1.94 | db |
| db-migrate-d573dd7-hzrbq | 0/1 | Succeeded | 0 | 2017-11-06 15:05:35+00:00 | 100.66.0.6 | 192.168.1.94 | db-migrate |
| engine-deps-jk1sq | 1/1 | Running | 0 | 2017-11-06 15:05:35+00:00 | 100.66.0.5 | 192.168.1.94 | engine-deps |
| ingress-controller-506514573-9fftg | 1/1 | Running | 0 | 2017-11-06 15:05:35+00:00 | 192.168.1.94 | 192.168.1.94 | ingress-controller |
| livelog-1589742313-d3lzb | 1/1 | Running | 0 | 2017-11-06 15:05:35+00:00 | 100.66.0.4 | 192.168.1.94 | livelog |
| reconciler-1584998901-txmr7 | 1/1 | Running | 0 | 2017-11-06 15:05:35+00:00 | 100.66.0.8 | 192.168.1.94 | reconciler |
| spark-port-forwarder-7k9rd | 1/1 | Running | 0 | 2017-11-06 15:05:36+00:00 | 192.168.1.94 | 192.168.1.94 | spark-port-forwarder |
| web-53233289-krp05 | 1/1 | Running | 0 | 2017-11-06 15:05:36+00:00 | 100.66.0.9 | 192.168.1.94 | web |
| web-53233289-ss5t3 | 1/1 | Running | 0 | 2017-11-06 15:05:36+00:00 | 100.66.0.10 | 192.168.1.94 | web |
| web-53233289-twk6p | 1/1 | Running | 0 | 2017-11-06 15:05:36+00:00 | 100.66.0.11 | 192.168.1.94 | web |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
All required pods are ready in cluster default.
All required Application services are configured.
All required config maps are ready.
All required secrets are available.
Persistent volumes are ready.
Persistent volume claims are ready.
Ingresses are ready.
OK: HTTP port check
Cloudera Data Science Workbench is not ready yet
[root@dsw-master ~]#

Cloudera Employee
Posts: 27
Registered: ‎07-09-2015

Re: CDSW Error: No module named numpy???

Hi Rob,

 

This is a good question. CDSW gives you isolation by having project specific dependencies stored in the project folders out of the box. If you need isolated dependencies on the Cluster side (Spark Executor side) also you need to follow the steps described in this blog post: 

https://blog.cloudera.com/blog/2017/04/use-your-favorite-python-library-on-pyspark-cluster-with-clou...

 

The "Cloudera Data Science Workbench is not ready yet" output for the cdsw status message while everything looks good is a known issue which we currently working on.

 

Regards,

Peter

New Contributor
Posts: 5
Registered: ‎11-06-2017

Re: CDSW Error: No module named numpy???

Thanks Peter.

 

On the cdsw status point, that's good to know. I'm just an amateur, exploring Cloudera on my VMWare homelab setup. It's been fun but a bit of a herculean effort getting CDH/CDM/CDSW setup correctly, but very satisying to know I now have a working setup I can experiment with. It's a pity there isnt a free license option for CDSW.

 

Kind Regards

 

Rob.

Announcements