Member since
11-06-2017
8
Posts
1
Kudos Received
0
Solutions
11-07-2017
05:58 AM
Thanks Peter. On the cdsw status point, that's good to know. I'm just an amateur, exploring Cloudera on my VMWare homelab setup. It's been fun but a bit of a herculean effort getting CDH/CDM/CDSW setup correctly, but very satisying to know I now have a working setup I can experiment with. It's a pity there isnt a free license option for CDSW. Kind Regards Rob.
... View more
11-07-2017
04:06 AM
Peter One more question if I may. Although CDSW indicates all health tests as green, when I issue the cdsw status command on the master node I get the following output: Thanks a lot Rob (London) ------------------------------- [root@dsw-master ~]# cdsw status Sending detailed logs to [/tmp/cdsw_status_cHSgF9.log] ... CDSW Version: [1.2.0:d573dd7] OK: Application running as root check Status check failed for services: [docker, kubelet, cdsw-app, cdsw-host-controller] OK: Sysctl params check ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | NAME | STATUS | CREATED-AT | VERSION | EXTERNAL-IP | OS-IMAGE | KERNEL-VERSION | GPU | STATEFUL | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | dsw-master.femto.lab | True | 2017-11-06 15:04:05+00:00 | v1.6.11 | None | CentOS Linux 7 (Core) | 3.10.0-514.el7.x86_64 | 0 | True | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 1/1 nodes are ready. ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | NAME | READY | STATUS | RESTARTS | CREATED-AT | POD-IP | HOST-IP | ROLE | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | etcd-dsw-master.femto.lab | 1/1 | Running | 0 | 2017-11-06 15:05:13+00:00 | 192.168.1.94 | 192.168.1.94 | None | | kube-apiserver-dsw-master.femto.lab | 1/1 | Running | 0 | 2017-11-06 15:04:02+00:00 | 192.168.1.94 | 192.168.1.94 | None | | kube-controller-manager-dsw-master.femto.lab | 1/1 | Running | 0 | 2017-11-06 15:05:13+00:00 | 192.168.1.94 | 192.168.1.94 | None | | kube-dns-3911048160-bsxrk | 3/3 | Running | 0 | 2017-11-06 15:04:18+00:00 | 100.66.0.2 | 192.168.1.94 | None | | kube-proxy-7mz7x | 1/1 | Running | 0 | 2017-11-06 15:04:19+00:00 | 192.168.1.94 | 192.168.1.94 | None | | kube-scheduler-dsw-master.femto.lab | 1/1 | Running | 0 | 2017-11-06 15:04:02+00:00 | 192.168.1.94 | 192.168.1.94 | None | | node-problem-detector-v0.1-ccnvs | 1/1 | Running | 0 | 2017-11-06 15:05:36+00:00 | 192.168.1.94 | 192.168.1.94 | None | | weave-net-b11mc | 2/2 | Running | 0 | 2017-11-06 15:04:19+00:00 | 192.168.1.94 | 192.168.1.94 | None | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ All required pods are ready in cluster kube-system. -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | NAME | READY | STATUS | RESTARTS | CREATED-AT | POD-IP | HOST-IP | ROLE | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | cron-962987953-pzq31 | 1/1 | Running | 0 | 2017-11-06 15:05:36+00:00 | 100.66.0.3 | 192.168.1.94 | cron | | db-875553086-tvss8 | 1/1 | Running | 0 | 2017-11-06 15:05:35+00:00 | 100.66.0.7 | 192.168.1.94 | db | | db-migrate-d573dd7-hzrbq | 0/1 | Succeeded | 0 | 2017-11-06 15:05:35+00:00 | 100.66.0.6 | 192.168.1.94 | db-migrate | | engine-deps-jk1sq | 1/1 | Running | 0 | 2017-11-06 15:05:35+00:00 | 100.66.0.5 | 192.168.1.94 | engine-deps | | ingress-controller-506514573-9fftg | 1/1 | Running | 0 | 2017-11-06 15:05:35+00:00 | 192.168.1.94 | 192.168.1.94 | ingress-controller | | livelog-1589742313-d3lzb | 1/1 | Running | 0 | 2017-11-06 15:05:35+00:00 | 100.66.0.4 | 192.168.1.94 | livelog | | reconciler-1584998901-txmr7 | 1/1 | Running | 0 | 2017-11-06 15:05:35+00:00 | 100.66.0.8 | 192.168.1.94 | reconciler | | spark-port-forwarder-7k9rd | 1/1 | Running | 0 | 2017-11-06 15:05:36+00:00 | 192.168.1.94 | 192.168.1.94 | spark-port-forwarder | | web-53233289-krp05 | 1/1 | Running | 0 | 2017-11-06 15:05:36+00:00 | 100.66.0.9 | 192.168.1.94 | web | | web-53233289-ss5t3 | 1/1 | Running | 0 | 2017-11-06 15:05:36+00:00 | 100.66.0.10 | 192.168.1.94 | web | | web-53233289-twk6p | 1/1 | Running | 0 | 2017-11-06 15:05:36+00:00 | 100.66.0.11 | 192.168.1.94 | web | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- All required pods are ready in cluster default. All required Application services are configured. All required config maps are ready. All required secrets are available. Persistent volumes are ready. Persistent volume claims are ready. Ingresses are ready. OK: HTTP port check Cloudera Data Science Workbench is not ready yet [root@dsw-master ~]#
... View more
11-07-2017
03:30 AM
Huge thanks Peter, the example scripts calling Numpy now work fine. I guess I'm a little confused though - the CDSW documentation talks about installing python packages inside the docker session so that they're islolated. But if you still need all these packages installed on the Yarn machines doesn't this defeat the object? Rob.
... View more
11-06-2017
01:29 PM
Thanks for reply Tristan. The kmeans.py script provided in the CDSW is the one I've tried. Rob. # # K-Means
#
# The K-means algorithm written from scratch against PySpark. In practice,
# one may prefer to use the KMeans algorithm in ML, as shown in
# [this example](https://github.com/apache/spark/blob/master/examples/src/main/python/ml/kmeans_example.py).
#
# This example requires [NumPy](http://www.numpy.org/).
from __future__ import print_function
import sys
import numpy as np
from pyspark.sql import SparkSession
def parseVector(line):
return np.array([float(x) for x in line.split(' ')])
def closestPoint(p, centers):
bestIndex = 0
closest = float("+inf")
for i in range(len(centers)):
tempDist = np.sum((p - centers[i]) ** 2)
if tempDist < closest:
closest = tempDist
bestIndex = i
return bestIndex
spark = SparkSession\
.builder\
.appName("PythonKMeans")\
.getOrCreate()
# Add the data file to hdfs.
!hdfs dfs -put resources/data/mllib/kmeans_data.txt /tmp
lines = spark.read.text("/tmp/kmeans_data.txt").rdd.map(lambda r: r[0])
data = lines.map(parseVector).cache()
K = 2
convergeDist = 0.1
kPoints = data.takeSample(False, K, 1)
tempDist = 1.0
while tempDist > convergeDist:
closest = data.map(
lambda p: (closestPoint(p, kPoints), (p, 1)))
pointStats = closest.reduceByKey(
lambda p1_c1, p2_c2: (p1_c1[0] + p2_c2[0], p1_c1[1] + p2_c2[1]))
newPoints = pointStats.map(
lambda st: (st[0], st[1][0] / st[1][1])).collect()
tempDist = sum(np.sum((kPoints[iK] - p) ** 2) for (iK, p) in newPoints)
for (iK, p) in newPoints:
kPoints[iK] = p
print("Final centers: " + str(kPoints))
spark.stop()
... View more
11-06-2017
12:12 PM
1 Kudo
Hi
I've just installed Data Science Workbench 1.2 on a single Master Node (under VMWARE 6.5). From my understanding of the documentation adding Worker Nodes is optional. The service comes up under the cluster okay and on Cloudera Manager (5.13) it has Green Health. Although when I run the commend cdsw status on the master node CLI it reports 'Cloudera Data Science Workbench is not ready yet'. It says 'Status check failed for services: [docker, kubelet, cdsw-app, cdsw-host-controller]'.
I can open a project successfully and some example pyspark files work fine. But any pyspark script that uses numpy gives the error:
File "/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 711, in subimport
__import__(name)
ImportError: ('No module named numpy', <function subimport at 0x1e75cf8>, ('numpy',))
When I issue the commend pip list on the session terminal it lists numpy (1.12.1) as being installed.
Any advice on fixing this would be much appreciated.
Thanks a lot.
Rob Sullivan (London)
... View more
Labels: