Support Questions

Marek · ‎06-24-2020

Hi,

Would appreciate any advice, how to solve the following problem – in a CDH 6.3.2 HA-enabled cluster I am unable to access HDFS from a CDSW CLI session:

!hdfs dfs -ls /
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"WARN","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:08:37","logger":"hdfs.DFSUtilClient","timezone":"UTC","log":{"message":"Namenode for namenodeHA remains unresolved for ID namenode43. Check your hdfs-site.xml file to ensure namenodes are configured properly."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"WARN","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:08:37","logger":"hdfs.DFSUtilClient","timezone":"UTC","log":{"message":"Namenode for namenodeHA remains unresolved for ID namenode57. Check your hdfs-site.xml file to ensure namenodes are configured properly."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:08:38","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-03.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-03.novalocal:8020 after 1 failover attempts. Trying to failover after sleeping for 813ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:08:38","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-02.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-02.novalocal:8020 after 2 failover attempts. Trying to failover after sleeping for 1903ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:08:40","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-03.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-03.novalocal:8020 after 3 failover attempts. Trying to failover after sleeping for 2225ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:08:43","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-02.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-02.novalocal:8020 after 4 failover attempts. Trying to failover after sleeping for 9688ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:08:52","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-03.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-03.novalocal:8020 after 5 failover attempts. Trying to failover after sleeping for 9501ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:09:02","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-02.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-02.novalocal:8020 after 6 failover attempts. Trying to failover after sleeping for 9001ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:09:11","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-03.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-03.novalocal:8020 after 7 failover attempts. Trying to failover after sleeping for 13904ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:09:25","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-02.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-02.novalocal:8020 after 8 failover attempts. Trying to failover after sleeping for 14567ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:09:39","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-03.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-03.novalocal:8020 after 9 failover attempts. Trying to failover after sleeping for 15279ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:09:55","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-02.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-02.novalocal:8020 after 10 failover attempts. Trying to failover after sleeping for 10985ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:10:05","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-03.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-03.novalocal:8020 after 11 failover attempts. Trying to failover after sleeping for 8394ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:10:14","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-02.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-02.novalocal:8020 after 12 failover attempts. Trying to failover after sleeping for 21701ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:10:36","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-03.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-03.novalocal:8020 after 13 failover attempts. Trying to failover after sleeping for 16983ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:10:53","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-02.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-02.novalocal:8020 after 14 failover attempts. Trying to failover after sleeping for 8437ms."}}
ls: Invalid host name: local host is: (unknown); destination host is: "cdh-control-03.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost

The contents of /etc/hosts files in the CDH and CDSW nodes is:

# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
10.10.10.112 cdh-control-02.novalocal
10.10.10.111 cdh-control-01.novalocal
10.10.10.131 cdh-worker-01.novalocal
10.10.10.132 cdh-worker-02.novalocal
10.10.10.122 cdh-edge-02.novalocal
10.10.10.113 cdh-control-03.novalocal
10.10.10.121 cdh-edge-01.novalocal
10.10.10.133 cdh-worker-03.novalocal
10.10.10.110 cdsw-master-01.novalocal
10.10.10.130 cdsw-worker-01.novalocal

GangWar · ‎09-02-2020

@Marek I think it's definitely network issue now.

Node IP: "Public-IP-Address" not found in the host's network interfaces

This message would indicate to me that the ip address of the host machine has changed or not at least above IP at network interface level of this host.

This thread is talked about the issue: https://github.com/kubernetes/kubernetes/issues/54337

The architecture which you are using is not supported, you might be able to hack thing using discussed in the thread:

Using --hostname-override=external-ip arguments for kubelet

but not a long term solution. So you have to revise the network architecture is what I personally recommend to you as CDSW is little sensitive about this.

Cheers!
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Marek · ‎09-02-2020

@GangWar Have changed the kubelet parameter in /opt/cloudera/parcels/CDSW/scripts/start-kubelet-master-standalone-core.sh as suggested:

#kubelet_opts+=(--hostname-override=${master_hostname_lower})
kubelet_opts+=(--hostname-override=external-ip)

Unfortunately the pods (kube-apiserver, kube-scheduler, etcd) keep crashing/exiting.

GangWar · ‎09-03-2020

What is the status of process logs? Are you able to see successful
registration of hosts?

Cheers!
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Marek · ‎09-03-2020

I do not see any successful host registrations. Please see below the tail of the process logs.

[root@cdsw-master-01 ~]# tail 10 /var/run/cloudera-scm-agent/process/19{09..11}*/logs/stderr.log
==> /var/run/cloudera-scm-agent/process/1909-cdsw-CDSW_DOCKER/logs/stderr.log <==
time="2020-09-03T07:00:55.018288801Z" level=error msg="Handler for GET /containers/12437b8b7b3b452bc7bfe8a3a26fe253de38601b7dd5093bd3d67a8f52b50e6b/json returned error: write unix /var/run/docker.sock->@: write: broken pipe" 
2020-09-03 07:00:55.018357 I | http: multiple response.WriteHeader calls
time="2020-09-03T07:01:12.350659606Z" level=info msg="stopping containerd after receiving terminated" 
time="2020-09-03T07:01:12.351645251Z" level=info msg="Processing signal 'terminated'" 
time="2020-09-03T07:01:12.352045287Z" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 13 desc = transport is closing" 
time="2020-09-03T07:01:13.187239486Z" level=info msg="libcontainerd: new containerd process, pid: 9176" 
time="2020-09-03T07:01:13.206461276Z" level=error msg="containerd: notify OOM events" error="open /proc/8671/cgroup: no such file or directory" 
time="2020-09-03T07:01:13.206730882Z" level=error msg="containerd: notify OOM events" error="open /proc/8808/cgroup: no such file or directory" 
time="2020-09-03T07:01:13.206985589Z" level=error msg="containerd: notify OOM events" error="open /proc/8995/cgroup: no such file or directory" 
time="2020-09-03T07:01:13.904988075Z" level=info msg="stopping containerd after receiving terminated" 

==> /var/run/cloudera-scm-agent/process/1910-cdsw-CDSW_MASTER/logs/stderr.log <==
E0903 07:00:54.262100   31064 kubelet.go:2266] node "external-ip" not found
E0903 07:00:54.362293   31064 kubelet.go:2266] node "external-ip" not found
E0903 07:00:54.462458   31064 kubelet.go:2266] node "external-ip" not found
E0903 07:00:54.480206   31064 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://10.133.210.200:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dexternal-ip&limit=500&resourceVersion=0: dial tcp 10.133.210.200:6443: connect: connection refused
E0903 07:00:54.480889   31064 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:444: Failed to list *v1.Service: Get https://10.133.210.200:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.133.210.200:6443: connect: connection refused
E0903 07:00:54.481951   31064 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:453: Failed to list *v1.Node: Get https://10.133.210.200:6443/api/v1/nodes?fieldSelector=metadata.name%3Dexternal-ip&limit=500&resourceVersion=0: dial tcp 10.133.210.200:6443: connect: connection refused
E0903 07:00:54.562631   31064 kubelet.go:2266] node "external-ip" not found
E0903 07:00:54.662826   31064 kubelet.go:2266] node "external-ip" not found
E0903 07:00:54.763006   31064 kubelet.go:2266] node "external-ip" not found
E0903 07:00:54.863203   31064 kubelet.go:2266] node "external-ip" not found

==> /var/run/cloudera-scm-agent/process/1911-cdsw-CDSW_APPLICATION/logs/stderr.log <==
    func(*targs, **kargs)
  File "/opt/cloudera/parcels/CDSW-1.7.2.p1.2066404/cdsw_admin/cdsw/admin.py", line 63, in stop
    os.killpg(os.getpid(), signal.SIGKILL)
OSError: [Errno 3] No such process
+ is_kubelet_process_up
+ is_kube_cluster_configured
+ '[' -e /etc/kubernetes/admin.conf ']'
+ return 0
++ KUBECONFIG=/etc/kubernetes/kubelet.conf
++ /opt/cloudera/parcels/CDSW-1.7.2.p1.2066404/kubernetes/bin/kubectl get nodes

Marek · ‎09-03-2020

@GangWar The problem with crashing/exiting pods is now fixed. After the CDSW master host restoration by mistake I provisioned it's MASTER_IP in CM config as the one resolved by DNS from CDSW FQDN, however it should be the host's private IP address within Cloudera cluster. Hence the intermediate problem is solved.

Let me then kindly ask for further assistance in troubleshooting the original issue with the HDFS access from CDSW sessions.

Cloudera Community

Support Questions

Unable to access HDFS from CDSW session