Created on 11-06-2017 12:12 PM - edited 09-16-2022 05:29 AM
Hi
I've just installed Data Science Workbench 1.2 on a single Master Node (under VMWARE 6.5). From my understanding of the documentation adding Worker Nodes is optional. The service comes up under the cluster okay and on Cloudera Manager (5.13) it has Green Health. Although when I run the commend cdsw status on the master node CLI it reports 'Cloudera Data Science Workbench is not ready yet'. It says 'Status check failed for services: [docker, kubelet, cdsw-app, cdsw-host-controller]'.
I can open a project successfully and some example pyspark files work fine. But any pyspark script that uses numpy gives the error:
File "/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 711, in subimport __import__(name) ImportError: ('No module named numpy', <function subimport at 0x1e75cf8>, ('numpy',))
When I issue the commend pip list on the session terminal it lists numpy (1.12.1) as being installed.
Any advice on fixing this would be much appreciated.
Thanks a lot.
Rob Sullivan (London)
Created 11-07-2017 12:18 AM
Hello Rob,
You are trying to use the numpy library from inside a map functions iterating over RDDs. The transformation which you specify will run on the Executors which will be hosted on different machines where you have YARN NodeManager running. To make this work you need to make sure that the numpy library is installed on all of the NodeManager machines.
Regards,
Peter
Created 11-06-2017 12:20 PM
Hi Rob,
Could you give a minimal example of a script that fails with this error? I'm specifically interested in where the import of numpy is being done.
Thanks,
Tristan
Created 11-06-2017 01:29 PM
Thanks for reply Tristan.
The kmeans.py script provided in the CDSW is the one I've tried.
Rob.
# # K-Means # # The K-means algorithm written from scratch against PySpark. In practice, # one may prefer to use the KMeans algorithm in ML, as shown in # [this example](https://github.com/apache/spark/blob/master/examples/src/main/python/ml/kmeans_example.py). # # This example requires [NumPy](http://www.numpy.org/). from __future__ import print_function import sys import numpy as np from pyspark.sql import SparkSession def parseVector(line): return np.array([float(x) for x in line.split(' ')]) def closestPoint(p, centers): bestIndex = 0 closest = float("+inf") for i in range(len(centers)): tempDist = np.sum((p - centers[i]) ** 2) if tempDist < closest: closest = tempDist bestIndex = i return bestIndex spark = SparkSession\ .builder\ .appName("PythonKMeans")\ .getOrCreate() # Add the data file to hdfs. !hdfs dfs -put resources/data/mllib/kmeans_data.txt /tmp lines = spark.read.text("/tmp/kmeans_data.txt").rdd.map(lambda r: r[0]) data = lines.map(parseVector).cache() K = 2 convergeDist = 0.1 kPoints = data.takeSample(False, K, 1) tempDist = 1.0 while tempDist > convergeDist: closest = data.map( lambda p: (closestPoint(p, kPoints), (p, 1))) pointStats = closest.reduceByKey( lambda p1_c1, p2_c2: (p1_c1[0] + p2_c2[0], p1_c1[1] + p2_c2[1])) newPoints = pointStats.map( lambda st: (st[0], st[1][0] / st[1][1])).collect() tempDist = sum(np.sum((kPoints[iK] - p) ** 2) for (iK, p) in newPoints) for (iK, p) in newPoints: kPoints[iK] = p print("Final centers: " + str(kPoints)) spark.stop()
Created 11-07-2017 12:18 AM
Hello Rob,
You are trying to use the numpy library from inside a map functions iterating over RDDs. The transformation which you specify will run on the Executors which will be hosted on different machines where you have YARN NodeManager running. To make this work you need to make sure that the numpy library is installed on all of the NodeManager machines.
Regards,
Peter
Created 11-07-2017 03:30 AM
Huge thanks Peter, the example scripts calling Numpy now work fine.
I guess I'm a little confused though - the CDSW documentation talks about installing python packages inside the docker session so that they're islolated. But if you still need all these packages installed on the Yarn machines doesn't this defeat the object?
Rob.
Created 11-07-2017 04:49 AM
Hi Rob,
This is a good question. CDSW gives you isolation by having project specific dependencies stored in the project folders out of the box. If you need isolated dependencies on the Cluster side (Spark Executor side) also you need to follow the steps described in this blog post:
The "Cloudera Data Science Workbench is not ready yet" output for the cdsw status message while everything looks good is a known issue which we currently working on.
Regards,
Peter
Created 11-07-2017 05:58 AM
Thanks Peter.
On the cdsw status point, that's good to know. I'm just an amateur, exploring Cloudera on my VMWare homelab setup. It's been fun but a bit of a herculean effort getting CDH/CDM/CDSW setup correctly, but very satisying to know I now have a working setup I can experiment with. It's a pity there isnt a free license option for CDSW.
Kind Regards
Rob.
Created 04-04-2018 11:07 AM
Hi Rob,
What did you do to get this resolved?
Peter,
I am facing similar issue, when trying to import requests module in foreachRDD. I am running in local mode and have this library available on the host. I am facing the below error.
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 710, in subimport
__import__(name)
ImportError: ('No module named requests', <function subimport at 0x7f2671268488>, ('requests',))
Created 04-04-2018 11:29 AM
Created 11-07-2017 04:06 AM
Peter
One more question if I may. Although CDSW indicates all health tests as green, when I issue the cdsw status command on the master node I get the following output:
Thanks a lot
Rob (London)
-------------------------------
[root@dsw-master ~]# cdsw status
Sending detailed logs to [/tmp/cdsw_status_cHSgF9.log] ...
CDSW Version: [1.2.0:d573dd7]
OK: Application running as root check
Status check failed for services: [docker, kubelet, cdsw-app, cdsw-host-controller]
OK: Sysctl params check
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| NAME | STATUS | CREATED-AT | VERSION | EXTERNAL-IP | OS-IMAGE | KERNEL-VERSION | GPU | STATEFUL |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| dsw-master.femto.lab | True | 2017-11-06 15:04:05+00:00 | v1.6.11 | None | CentOS Linux 7 (Core) | 3.10.0-514.el7.x86_64 | 0 | True |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1/1 nodes are ready.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| NAME | READY | STATUS | RESTARTS | CREATED-AT | POD-IP | HOST-IP | ROLE |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| etcd-dsw-master.femto.lab | 1/1 | Running | 0 | 2017-11-06 15:05:13+00:00 | 192.168.1.94 | 192.168.1.94 | None |
| kube-apiserver-dsw-master.femto.lab | 1/1 | Running | 0 | 2017-11-06 15:04:02+00:00 | 192.168.1.94 | 192.168.1.94 | None |
| kube-controller-manager-dsw-master.femto.lab | 1/1 | Running | 0 | 2017-11-06 15:05:13+00:00 | 192.168.1.94 | 192.168.1.94 | None |
| kube-dns-3911048160-bsxrk | 3/3 | Running | 0 | 2017-11-06 15:04:18+00:00 | 100.66.0.2 | 192.168.1.94 | None |
| kube-proxy-7mz7x | 1/1 | Running | 0 | 2017-11-06 15:04:19+00:00 | 192.168.1.94 | 192.168.1.94 | None |
| kube-scheduler-dsw-master.femto.lab | 1/1 | Running | 0 | 2017-11-06 15:04:02+00:00 | 192.168.1.94 | 192.168.1.94 | None |
| node-problem-detector-v0.1-ccnvs | 1/1 | Running | 0 | 2017-11-06 15:05:36+00:00 | 192.168.1.94 | 192.168.1.94 | None |
| weave-net-b11mc | 2/2 | Running | 0 | 2017-11-06 15:04:19+00:00 | 192.168.1.94 | 192.168.1.94 | None |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
All required pods are ready in cluster kube-system.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| NAME | READY | STATUS | RESTARTS | CREATED-AT | POD-IP | HOST-IP | ROLE |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| cron-962987953-pzq31 | 1/1 | Running | 0 | 2017-11-06 15:05:36+00:00 | 100.66.0.3 | 192.168.1.94 | cron |
| db-875553086-tvss8 | 1/1 | Running | 0 | 2017-11-06 15:05:35+00:00 | 100.66.0.7 | 192.168.1.94 | db |
| db-migrate-d573dd7-hzrbq | 0/1 | Succeeded | 0 | 2017-11-06 15:05:35+00:00 | 100.66.0.6 | 192.168.1.94 | db-migrate |
| engine-deps-jk1sq | 1/1 | Running | 0 | 2017-11-06 15:05:35+00:00 | 100.66.0.5 | 192.168.1.94 | engine-deps |
| ingress-controller-506514573-9fftg | 1/1 | Running | 0 | 2017-11-06 15:05:35+00:00 | 192.168.1.94 | 192.168.1.94 | ingress-controller |
| livelog-1589742313-d3lzb | 1/1 | Running | 0 | 2017-11-06 15:05:35+00:00 | 100.66.0.4 | 192.168.1.94 | livelog |
| reconciler-1584998901-txmr7 | 1/1 | Running | 0 | 2017-11-06 15:05:35+00:00 | 100.66.0.8 | 192.168.1.94 | reconciler |
| spark-port-forwarder-7k9rd | 1/1 | Running | 0 | 2017-11-06 15:05:36+00:00 | 192.168.1.94 | 192.168.1.94 | spark-port-forwarder |
| web-53233289-krp05 | 1/1 | Running | 0 | 2017-11-06 15:05:36+00:00 | 100.66.0.9 | 192.168.1.94 | web |
| web-53233289-ss5t3 | 1/1 | Running | 0 | 2017-11-06 15:05:36+00:00 | 100.66.0.10 | 192.168.1.94 | web |
| web-53233289-twk6p | 1/1 | Running | 0 | 2017-11-06 15:05:36+00:00 | 100.66.0.11 | 192.168.1.94 | web |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
All required pods are ready in cluster default.
All required Application services are configured.
All required config maps are ready.
All required secrets are available.
Persistent volumes are ready.
Persistent volume claims are ready.
Ingresses are ready.
OK: HTTP port check
Cloudera Data Science Workbench is not ready yet
[root@dsw-master ~]#