I am installing Cloudera DSW on 2 new machines in an existing cluster via Cloudera Manager.
All my services (docker daemons, master, worker, application) are green. But the status is bad, stating "Failed to connect to Kubernetes Master api.". Looks like Kubernetes is not running at all. When I try to do "kubectl config view", it is pretty empty - of all the values only the version is set. When I try "kubectl cluster-info", I get the expected error "The connection to the server localhost:8080 was refused" (no one is listening, the API is not up). I tried searching for the Kubernetes admin.conf file and did not find it in the system. I tried "kubeadm init", but it can't connect to the Internet. The server is running behind a proxy, but the proxy seettings are in the environment variables, and I have also added them to the Workbench settings. I am at a loss here and don't know anymore where to look or what to try. Any help would be appreciated. I can provide all necessary info about the installation, of course, I tried specifying the important parts.
Some of the common issues that we have seen for this sort of error are:
1. We need a set of NO_PROXY environment variables to be set as well in case of machines behind a proxy (https://www.cloudera.com/documentation/data-science-workbench/latest/topics/cdsw_proxy.html)
2. The localhost does not resolves to 127.0.0.1 or the local IP: 'nslookup localhost' should return the right address. Kubernetes does not honor the /etc/hosts entries
3. The 'memory' cgroups are disabled on the machine. This can be verified using 'ls -lrt /sys/fs/cgroup/memory/' on most machines.
Thank you for the swift reply!
1. I have already configured NO_PROXY with the value from the page you specified (changed <MASTER_IP> to the IP of my master host), that didn't seem to help. I have also set HTTP_PROXY and HTTPS_PROXY. However, I didn't know about the certificate part. I personally don't have access to our proxy certificate, I will ask an admin to give it to me, let's see if that helps.
2. Checked that before, looks fine.
3. I performed the check and can see many memory cgroups. I believe they are not disabled then.
So looks like so far the proxy certificate is the only thing I can try to fix. I will post again as soon as I do it.
Thanks for all the replies, we ended up disabling proxy and that helped, but that is a short-term solution. Looks like we didn't have the needed SSL certificates for the hosts (I was suprised to find that out), so as soon as we get them that should solve our problem.