Created 06-24-2020 07:35 AM
Hi,
Would appreciate any advice, how to solve the following problem – in a CDH 6.3.2 HA-enabled cluster I am unable to access HDFS from a CDSW CLI session:
!hdfs dfs -ls /
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"WARN","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:08:37","logger":"hdfs.DFSUtilClient","timezone":"UTC","log":{"message":"Namenode for namenodeHA remains unresolved for ID namenode43. Check your hdfs-site.xml file to ensure namenodes are configured properly."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"WARN","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:08:37","logger":"hdfs.DFSUtilClient","timezone":"UTC","log":{"message":"Namenode for namenodeHA remains unresolved for ID namenode57. Check your hdfs-site.xml file to ensure namenodes are configured properly."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:08:38","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-03.novalocal":8020; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-03.novalocal:8020 after 1 failover attempts. Trying to failover after sleeping for 813ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:08:38","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-02.novalocal":8020; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-02.novalocal:8020 after 2 failover attempts. Trying to failover after sleeping for 1903ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:08:40","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-03.novalocal":8020; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-03.novalocal:8020 after 3 failover attempts. Trying to failover after sleeping for 2225ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:08:43","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-02.novalocal":8020; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-02.novalocal:8020 after 4 failover attempts. Trying to failover after sleeping for 9688ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:08:52","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-03.novalocal":8020; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-03.novalocal:8020 after 5 failover attempts. Trying to failover after sleeping for 9501ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:09:02","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-02.novalocal":8020; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-02.novalocal:8020 after 6 failover attempts. Trying to failover after sleeping for 9001ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:09:11","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-03.novalocal":8020; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-03.novalocal:8020 after 7 failover attempts. Trying to failover after sleeping for 13904ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:09:25","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-02.novalocal":8020; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-02.novalocal:8020 after 8 failover attempts. Trying to failover after sleeping for 14567ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:09:39","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-03.novalocal":8020; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-03.novalocal:8020 after 9 failover attempts. Trying to failover after sleeping for 15279ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:09:55","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-02.novalocal":8020; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-02.novalocal:8020 after 10 failover attempts. Trying to failover after sleeping for 10985ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:10:05","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-03.novalocal":8020; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-03.novalocal:8020 after 11 failover attempts. Trying to failover after sleeping for 8394ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:10:14","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-02.novalocal":8020; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-02.novalocal:8020 after 12 failover attempts. Trying to failover after sleeping for 21701ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:10:36","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-03.novalocal":8020; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-03.novalocal:8020 after 13 failover attempts. Trying to failover after sleeping for 16983ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:10:53","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-02.novalocal":8020; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-02.novalocal:8020 after 14 failover attempts. Trying to failover after sleeping for 8437ms."}}
ls: Invalid host name: local host is: (unknown); destination host is: "cdh-control-03.novalocal":8020; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost
The contents of /etc/hosts files in the CDH and CDSW nodes is:
# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
10.10.10.112 cdh-control-02.novalocal
10.10.10.111 cdh-control-01.novalocal
10.10.10.131 cdh-worker-01.novalocal
10.10.10.132 cdh-worker-02.novalocal
10.10.10.122 cdh-edge-02.novalocal
10.10.10.113 cdh-control-03.novalocal
10.10.10.121 cdh-edge-01.novalocal
10.10.10.133 cdh-worker-03.novalocal
10.10.10.110 cdsw-master-01.novalocal
10.10.10.130 cdsw-worker-01.novalocal
Created 09-02-2020 11:20 AM
@Marek I think it's definitely network issue now.
Node IP: "Public-IP-Address" not found in the host's network interfaces
This message would indicate to me that the ip address of the host machine has changed or not at least above IP at network interface level of this host.
This thread is talked about the issue: https://github.com/kubernetes/kubernetes/issues/54337
The architecture which you are using is not supported, you might be able to hack thing using discussed in the thread:
Using --hostname-override=external-ip arguments for kubelet
but not a long term solution. So you have to revise the network architecture is what I personally recommend to you as CDSW is little sensitive about this.
Created on 09-02-2020 11:54 PM - edited 09-03-2020 04:17 AM
@GangWar Have changed the kubelet parameter in /opt/cloudera/parcels/CDSW/scripts/start-kubelet-master-standalone-core.sh as suggested:
#kubelet_opts+=(--hostname-override=${master_hostname_lower})
kubelet_opts+=(--hostname-override=external-ip)
Unfortunately the pods (kube-apiserver, kube-scheduler, etcd) keep crashing/exiting.
Created 09-03-2020 12:23 AM
Created 09-03-2020 12:33 AM
I do not see any successful host registrations. Please see below the tail of the process logs.
[root@cdsw-master-01 ~]# tail 10 /var/run/cloudera-scm-agent/process/19{09..11}*/logs/stderr.log
==> /var/run/cloudera-scm-agent/process/1909-cdsw-CDSW_DOCKER/logs/stderr.log <==
time="2020-09-03T07:00:55.018288801Z" level=error msg="Handler for GET /containers/12437b8b7b3b452bc7bfe8a3a26fe253de38601b7dd5093bd3d67a8f52b50e6b/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
2020-09-03 07:00:55.018357 I | http: multiple response.WriteHeader calls
time="2020-09-03T07:01:12.350659606Z" level=info msg="stopping containerd after receiving terminated"
time="2020-09-03T07:01:12.351645251Z" level=info msg="Processing signal 'terminated'"
time="2020-09-03T07:01:12.352045287Z" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 13 desc = transport is closing"
time="2020-09-03T07:01:13.187239486Z" level=info msg="libcontainerd: new containerd process, pid: 9176"
time="2020-09-03T07:01:13.206461276Z" level=error msg="containerd: notify OOM events" error="open /proc/8671/cgroup: no such file or directory"
time="2020-09-03T07:01:13.206730882Z" level=error msg="containerd: notify OOM events" error="open /proc/8808/cgroup: no such file or directory"
time="2020-09-03T07:01:13.206985589Z" level=error msg="containerd: notify OOM events" error="open /proc/8995/cgroup: no such file or directory"
time="2020-09-03T07:01:13.904988075Z" level=info msg="stopping containerd after receiving terminated"
==> /var/run/cloudera-scm-agent/process/1910-cdsw-CDSW_MASTER/logs/stderr.log <==
E0903 07:00:54.262100 31064 kubelet.go:2266] node "external-ip" not found
E0903 07:00:54.362293 31064 kubelet.go:2266] node "external-ip" not found
E0903 07:00:54.462458 31064 kubelet.go:2266] node "external-ip" not found
E0903 07:00:54.480206 31064 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://10.133.210.200:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dexternal-ip&limit=500&resourceVersion=0: dial tcp 10.133.210.200:6443: connect: connection refused
E0903 07:00:54.480889 31064 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:444: Failed to list *v1.Service: Get https://10.133.210.200:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.133.210.200:6443: connect: connection refused
E0903 07:00:54.481951 31064 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:453: Failed to list *v1.Node: Get https://10.133.210.200:6443/api/v1/nodes?fieldSelector=metadata.name%3Dexternal-ip&limit=500&resourceVersion=0: dial tcp 10.133.210.200:6443: connect: connection refused
E0903 07:00:54.562631 31064 kubelet.go:2266] node "external-ip" not found
E0903 07:00:54.662826 31064 kubelet.go:2266] node "external-ip" not found
E0903 07:00:54.763006 31064 kubelet.go:2266] node "external-ip" not found
E0903 07:00:54.863203 31064 kubelet.go:2266] node "external-ip" not found
==> /var/run/cloudera-scm-agent/process/1911-cdsw-CDSW_APPLICATION/logs/stderr.log <==
func(*targs, **kargs)
File "/opt/cloudera/parcels/CDSW-1.7.2.p1.2066404/cdsw_admin/cdsw/admin.py", line 63, in stop
os.killpg(os.getpid(), signal.SIGKILL)
OSError: [Errno 3] No such process
+ is_kubelet_process_up
+ is_kube_cluster_configured
+ '[' -e /etc/kubernetes/admin.conf ']'
+ return 0
++ KUBECONFIG=/etc/kubernetes/kubelet.conf
++ /opt/cloudera/parcels/CDSW-1.7.2.p1.2066404/kubernetes/bin/kubectl get nodes
Created on 09-03-2020 06:22 AM - edited 09-03-2020 06:23 AM
@GangWar The problem with crashing/exiting pods is now fixed. After the CDSW master host restoration by mistake I provisioned it's MASTER_IP in CM config as the one resolved by DNS from CDSW FQDN, however it should be the host's private IP address within Cloudera cluster. Hence the intermediate problem is solved.
Let me then kindly ask for further assistance in troubleshooting the original issue with the HDFS access from CDSW sessions.