Created 09-02-2020 11:20 AM
@Marek I think it's definitely network issue now.
Node IP: "Public-IP-Address" not found in the host's network interfaces
This message would indicate to me that the ip address of the host machine has changed or not at least above IP at network interface level of this host.
This thread is talked about the issue: https://github.com/kubernetes/kubernetes/issues/54337
The architecture which you are using is not supported, you might be able to hack thing using discussed in the thread:
Using --hostname-override=external-ip arguments for kubelet
but not a long term solution. So you have to revise the network architecture is what I personally recommend to you as CDSW is little sensitive about this.
Created on 09-02-2020 11:54 PM - edited 09-03-2020 04:17 AM
@GangWar Have changed the kubelet parameter in /opt/cloudera/parcels/CDSW/scripts/start-kubelet-master-standalone-core.sh as suggested:
#kubelet_opts+=(--hostname-override=${master_hostname_lower})
kubelet_opts+=(--hostname-override=external-ip)
Unfortunately the pods (kube-apiserver, kube-scheduler, etcd) keep crashing/exiting.
Created 09-03-2020 12:23 AM
Created 09-03-2020 12:33 AM
I do not see any successful host registrations. Please see below the tail of the process logs.
[root@cdsw-master-01 ~]# tail 10 /var/run/cloudera-scm-agent/process/19{09..11}*/logs/stderr.log
==> /var/run/cloudera-scm-agent/process/1909-cdsw-CDSW_DOCKER/logs/stderr.log <==
time="2020-09-03T07:00:55.018288801Z" level=error msg="Handler for GET /containers/12437b8b7b3b452bc7bfe8a3a26fe253de38601b7dd5093bd3d67a8f52b50e6b/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
2020-09-03 07:00:55.018357 I | http: multiple response.WriteHeader calls
time="2020-09-03T07:01:12.350659606Z" level=info msg="stopping containerd after receiving terminated"
time="2020-09-03T07:01:12.351645251Z" level=info msg="Processing signal 'terminated'"
time="2020-09-03T07:01:12.352045287Z" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 13 desc = transport is closing"
time="2020-09-03T07:01:13.187239486Z" level=info msg="libcontainerd: new containerd process, pid: 9176"
time="2020-09-03T07:01:13.206461276Z" level=error msg="containerd: notify OOM events" error="open /proc/8671/cgroup: no such file or directory"
time="2020-09-03T07:01:13.206730882Z" level=error msg="containerd: notify OOM events" error="open /proc/8808/cgroup: no such file or directory"
time="2020-09-03T07:01:13.206985589Z" level=error msg="containerd: notify OOM events" error="open /proc/8995/cgroup: no such file or directory"
time="2020-09-03T07:01:13.904988075Z" level=info msg="stopping containerd after receiving terminated"
==> /var/run/cloudera-scm-agent/process/1910-cdsw-CDSW_MASTER/logs/stderr.log <==
E0903 07:00:54.262100 31064 kubelet.go:2266] node "external-ip" not found
E0903 07:00:54.362293 31064 kubelet.go:2266] node "external-ip" not found
E0903 07:00:54.462458 31064 kubelet.go:2266] node "external-ip" not found
E0903 07:00:54.480206 31064 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://10.133.210.200:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dexternal-ip&limit=500&resourceVersion=0: dial tcp 10.133.210.200:6443: connect: connection refused
E0903 07:00:54.480889 31064 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:444: Failed to list *v1.Service: Get https://10.133.210.200:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.133.210.200:6443: connect: connection refused
E0903 07:00:54.481951 31064 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:453: Failed to list *v1.Node: Get https://10.133.210.200:6443/api/v1/nodes?fieldSelector=metadata.name%3Dexternal-ip&limit=500&resourceVersion=0: dial tcp 10.133.210.200:6443: connect: connection refused
E0903 07:00:54.562631 31064 kubelet.go:2266] node "external-ip" not found
E0903 07:00:54.662826 31064 kubelet.go:2266] node "external-ip" not found
E0903 07:00:54.763006 31064 kubelet.go:2266] node "external-ip" not found
E0903 07:00:54.863203 31064 kubelet.go:2266] node "external-ip" not found
==> /var/run/cloudera-scm-agent/process/1911-cdsw-CDSW_APPLICATION/logs/stderr.log <==
func(*targs, **kargs)
File "/opt/cloudera/parcels/CDSW-1.7.2.p1.2066404/cdsw_admin/cdsw/admin.py", line 63, in stop
os.killpg(os.getpid(), signal.SIGKILL)
OSError: [Errno 3] No such process
+ is_kubelet_process_up
+ is_kube_cluster_configured
+ '[' -e /etc/kubernetes/admin.conf ']'
+ return 0
++ KUBECONFIG=/etc/kubernetes/kubelet.conf
++ /opt/cloudera/parcels/CDSW-1.7.2.p1.2066404/kubernetes/bin/kubectl get nodes
Created on 09-03-2020 06:22 AM - edited 09-03-2020 06:23 AM
@GangWar The problem with crashing/exiting pods is now fixed. After the CDSW master host restoration by mistake I provisioned it's MASTER_IP in CM config as the one resolved by DNS from CDSW FQDN, however it should be the host's private IP address within Cloudera cluster. Hence the intermediate problem is solved.
Let me then kindly ask for further assistance in troubleshooting the original issue with the HDFS access from CDSW sessions.