Support Questions

Find answers, ask questions, and share your expertise
Announcements
Welcome to the upgraded Community! Read this blog to see What’s New!

Unable to access HDFS from CDSW session

avatar
Explorer

Hi,

Would appreciate any advice, how to solve the following problem – in a CDH 6.3.2 HA-enabled cluster I am unable to access HDFS from a CDSW CLI session:

 

!hdfs dfs -ls /
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"WARN","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:08:37","logger":"hdfs.DFSUtilClient","timezone":"UTC","log":{"message":"Namenode for namenodeHA remains unresolved for ID namenode43. Check your hdfs-site.xml file to ensure namenodes are configured properly."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"WARN","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:08:37","logger":"hdfs.DFSUtilClient","timezone":"UTC","log":{"message":"Namenode for namenodeHA remains unresolved for ID namenode57. Check your hdfs-site.xml file to ensure namenodes are configured properly."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:08:38","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-03.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-03.novalocal:8020 after 1 failover attempts. Trying to failover after sleeping for 813ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:08:38","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-02.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-02.novalocal:8020 after 2 failover attempts. Trying to failover after sleeping for 1903ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:08:40","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-03.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-03.novalocal:8020 after 3 failover attempts. Trying to failover after sleeping for 2225ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:08:43","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-02.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-02.novalocal:8020 after 4 failover attempts. Trying to failover after sleeping for 9688ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:08:52","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-03.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-03.novalocal:8020 after 5 failover attempts. Trying to failover after sleeping for 9501ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:09:02","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-02.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-02.novalocal:8020 after 6 failover attempts. Trying to failover after sleeping for 9001ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:09:11","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-03.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-03.novalocal:8020 after 7 failover attempts. Trying to failover after sleeping for 13904ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:09:25","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-02.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-02.novalocal:8020 after 8 failover attempts. Trying to failover after sleeping for 14567ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:09:39","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-03.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-03.novalocal:8020 after 9 failover attempts. Trying to failover after sleeping for 15279ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:09:55","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-02.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-02.novalocal:8020 after 10 failover attempts. Trying to failover after sleeping for 10985ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:10:05","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-03.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-03.novalocal:8020 after 11 failover attempts. Trying to failover after sleeping for 8394ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:10:14","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-02.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-02.novalocal:8020 after 12 failover attempts. Trying to failover after sleeping for 21701ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:10:36","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-03.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-03.novalocal:8020 after 13 failover attempts. Trying to failover after sleeping for 16983ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/06/24 13:10:53","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "cdh-control-02.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over cdh-control-02.novalocal:8020 after 14 failover attempts. Trying to failover after sleeping for 8437ms."}}
ls: Invalid host name: local host is: (unknown); destination host is: "cdh-control-03.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost

 

The contents of /etc/hosts files in the CDH and CDSW nodes is:

 

# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
10.10.10.112 cdh-control-02.novalocal
10.10.10.111 cdh-control-01.novalocal
10.10.10.131 cdh-worker-01.novalocal
10.10.10.132 cdh-worker-02.novalocal
10.10.10.122 cdh-edge-02.novalocal
10.10.10.113 cdh-control-03.novalocal
10.10.10.121 cdh-edge-01.novalocal
10.10.10.133 cdh-worker-03.novalocal
10.10.10.110 cdsw-master-01.novalocal
10.10.10.130 cdsw-worker-01.novalocal

 

 

24 REPLIES 24

avatar

@Marek CDSW don't care about /etc/hosts file. You must have to meet all network Requirements below:

https://docs.cloudera.com/documentation/data-science-workbench/1-6-x/topics/cdsw_requirements_suppor...

 

Most importantly forward/reverse lookup and then the wildcard dns to work form session. The issue seems to be with wildcard as form session the hostname is not resolving.


Cheers!
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

avatar
Explorer

I do confirm that the CDSW hosts meet all the networking requirements, in particular:

  • IPv6 is enabled
  • CDSW hosts are within the same subnet as the CDH cluster
  • DNS is configured with the relevant A record for domain name, CNAME record for wildcard domain, and a reverse PTR domain record
  • No iptables rules were enabled
  • SElinux is disabled

Let me also clarify – I can launch a session, however within a session I am unable to access the HDFS, from input prompt (as in my first post) nor any script.

Example DNS lookup commands from a session's input prompt:

 

 

 

!nslookup *.cdsw.<intranetdomain>
Server:		100.77.0.10
Address:	100.77.0.10#53

Non-authoritative answer:
*.cdsw.<intranetdomain>	canonical name = cdsw.<intranetdomain>.
Name:	cdsw.<intranetdomain>
Address: 10.133.210.200

!dig -x 10.133.210.200
; <<>> DiG 9.11.3-1ubuntu1.11-Ubuntu <<>> -x 10.133.210.200
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 60863
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;200.210.133.10.in-addr.arpa.	IN	PTR

;; ANSWER SECTION:
200.210.133.10.in-addr.arpa. 300 IN	PTR	cdsw.<intranetdomain>.

;; Query time: 307 msec
;; SERVER: 100.77.0.10#53(100.77.0.10)
;; WHEN: Thu Jun 25 08:05:22 UTC 2020
;; MSG SIZE  rcvd: 93

 

 

 

 I have also noticed that I am unable to access a terminal – web browser returns HTTP ERROR 401. Though DNS resolves the terminal's FQDN to CDSW master node's IP.

CDSW_terminal_1.png

 

[cloud-user@cdh-control-01 ~]$ ping -c1 tty-jidv65sd8630btx4.cdsw.<intranetdomain>
PING cdsw.<intranetdomain> (10.133.210.200) 56(84) bytes of data.
64 bytes from cdsw.<intranetdomain> (10.133.210.200): icmp_seq=1 ttl=60 time=0.884 ms

--- cdsw.<intranetdomain> ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.884/0.884/0.884/0.000 ms

 

 

 

 

avatar
Explorer

A kind reminder about this open support question.

avatar

@Marek CDSW don’t honour /etc/hosts file so that's not an issue. Can you confirm the localhost is resolving to 127.0.0.0 and if the that is then please share the cdsw logs bundle, I will check once more. 

cdsw logs -x

If you have support subscription feel free to file a case with us, we will be more than happy to assist you.  


Cheers!
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

avatar
Explorer

@GangWar

I do confirm that localhost resolves to 127.0.0.1, not to 127.0.0.0, which I believe is a typo, isn't it?

[root@cdsw-master-01 ~]# nslookup localhost
Server:         172.16.1.3
Address:        172.16.1.3#53

Non-authoritative answer:
Name:   localhost
Address: 127.0.0.1

This is related to a CDSW proof-of-concept/trial on top of a CDH Enterprise R&D cluster, hence I am unable to submit a support case, though would be glad to do that. Please check your private messages inbox regarding the logs bundle.

avatar

@Marek Is the hdfs dns -ls is working form CDSW node itself? 

 

The below error means there is some issue with client file. 

Namenode for namenodeHA remains unresolved for ID namenode43. Check your hdfs-site.xml file to ensure namenodes are configured properly.

Can you check if you have gateway roles installed and form CDSW master node you are able to list files. 

 

I would like to perform:

 

  1. Deploy client configurations for all HDFS roles again.
  2. Restart NN. 
  3. Check the Gateway Role is available on CDSW hosts.
  4. Form CDSW host doc a list on HDFS. 

Cheers!
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

avatar
Explorer

@GangWar I do confirm that I am able to list the HDFS files from the CDSW master node:

 

[root@cdsw-master-01 ~]# hdfs dfs -ls /
Found 3 items
drwxr-xr-x   - hbase hbase               0 2020-06-29 19:23 /hbase
drwxrwxrwt   - hdfs  supergroup          0 2020-06-29 21:05 /tmp
drwxr-xr-x   - hdfs  supergroup          0 2020-06-29 21:44 /user

 

Have re-deployed client configurations and refreshed the cluster.

Have restarted NN roles.

Do confirm that the HDFS gateway roles are available on the CDSW hosts:

CDSW_HDFS_access_error.png

Please clarify what you mean by "Form CDSW host doc a list on HDFS".

From a CDSW session input prompt I try to access HDFS, however still get the error:

 

!hdfs dfs -ls /
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"WARN","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/07/02 09:08:35","logger":"hdfs.DFSUtilClient","timezone":"UTC","log":{"message":"Namenode for namenodeHA remains unresolved for ID namenode43. Check your hdfs-site.xml file to ensure namenodes are configured properly."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"WARN","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/07/02 09:08:35","logger":"hdfs.DFSUtilClient","timezone":"UTC","log":{"message":"Namenode for namenodeHA remains unresolved for ID namenode57. Check your hdfs-site.xml file to ensure namenodes are configured properly."}}

 

Hence would appreciate your further assistance in the troubleshooting.

avatar
Explorer

A kind reminder about this open support question.

avatar
Explorer

Let me refresh and kindly remind about this open support question.

avatar

@Marek The only thing I can think of form the logs is issue with client Configuration. 

What’s the hdfs-site.xml saying can we have a copy.
What if you ran /opt/cloudera/parcels/CDH/bin/hdfs dfs -ls / form session?
What engine version is running? Some engines did not have the hdfs client installed.
Run below command form Session and provide the output 
echo $PATH
which hdfs

I am wondering if the path /opt/cloudera/parcels/CDH/bin is missing form default paths set in session then you have to export the path /opt/cloudera/parcels/CDH/bin manually and run the hdfs listing again to see if that works. 


Cheers!
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

avatar
Explorer

@GangWar 

Please see the CDSW session command log and the hdfs-site.xml file contents enclosed.

 

 

!echo $PATH
/usr/lib/jvm/jre-openjdk/bin:/home/cdsw/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/conda/bin:/opt/cloudera/parcels/CDH/bin:/home/cdsw/.conda/envs/python3.6/bin
!which hdfs
/opt/cloudera/parcels/CDH/bin/hdfs
!/opt/cloudera/parcels/CDH/bin/hdfs dfs -ls /
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"WARN","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/07/28 11:28:47","logger":"hdfs.DFSUtilClient","timezone":"UTC","log":{"message":"Namenode for namenodeHA remains unresolved for ID namenode43. Check your hdfs-site.xml file to ensure namenodes are configured properly."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"WARN","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/07/28 11:28:47","logger":"hdfs.DFSUtilClient","timezone":"UTC","log":{"message":"Namenode for namenodeHA remains unresolved for ID namenode57. Check your hdfs-site.xml file to ensure namenodes are configured properly."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/07/28 11:28:47","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "blc-control-03.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over blc-control-03.novalocal:8020 after 1 failover attempts. Trying to failover after sleeping for 1424ms."}}
{"type":"log","host":"host_name","category":"HDFS-hdfs-GATEWAY-BASE","level":"INFO","system":"etcd_clcm_std_3C_2E_3W_cdh","time": "20/07/28 11:28:49","logger":"retry.RetryInvocationHandler","timezone":"UTC","log":{"message":"java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "blc-control-02.novalocal":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over blc-control-02.novalocal:8020 after 2 failover attempts. Trying to failover after sleeping for 2662ms."}}
<?xml version="1.0" encoding="UTF-8"?>

<!--Autogenerated by Cloudera Manager-->
<configuration>
  <property>
    <name>dfs.nameservices</name>
    <value>namenodeHA</value>
  </property>
  <property>
    <name>dfs.client.failover.proxy.provider.namenodeHA</name>
    <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
  </property>
  <property>
    <name>dfs.ha.automatic-failover.enabled.namenodeHA</name>
    <value>true</value>
  </property>
  <property>
    <name>ha.zookeeper.quorum</name>
    <value>blc-control-01.novalocal:2181,blc-control-02.novalocal:2181,blc-control-03.novalocal:2181</value>
  </property>
  <property>
    <name>dfs.ha.namenodes.namenodeHA</name>
    <value>namenode43,namenode57</value>
  </property>
  <property>
    <name>dfs.namenode.rpc-address.namenodeHA.namenode43</name>
    <value>blc-control-02.novalocal:8020</value>
  </property>
  <property>
    <name>dfs.namenode.servicerpc-address.namenodeHA.namenode43</name>
    <value>blc-control-02.novalocal:8022</value>
  </property>
  <property>
    <name>dfs.namenode.http-address.namenodeHA.namenode43</name>
    <value>blc-control-02.novalocal:9870</value>
  </property>
  <property>
    <name>dfs.namenode.https-address.namenodeHA.namenode43</name>
    <value>blc-control-02.novalocal:9871</value>
  </property>
  <property>
    <name>dfs.namenode.rpc-address.namenodeHA.namenode57</name>
    <value>blc-control-03.novalocal:8020</value>
  </property>
  <property>
    <name>dfs.namenode.servicerpc-address.namenodeHA.namenode57</name>
    <value>blc-control-03.novalocal:8022</value>
  </property>
  <property>
    <name>dfs.namenode.http-address.namenodeHA.namenode57</name>
    <value>blc-control-03.novalocal:9870</value>
  </property>
  <property>
    <name>dfs.namenode.https-address.namenodeHA.namenode57</name>
    <value>blc-control-03.novalocal:9871</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
  <property>
    <name>dfs.blocksize</name>
    <value>134217728</value>
  </property>
  <property>
    <name>dfs.client.use.datanode.hostname</name>
    <value>false</value>
  </property>
  <property>
    <name>fs.permissions.umask-mode</name>
    <value>022</value>
  </property>
  <property>
    <name>dfs.client.block.write.locateFollowingBlock.retries</name>
    <value>7</value>
  </property>
  <property>
    <name>dfs.namenode.acls.enabled</name>
    <value>false</value>
  </property>
  <property>
    <name>dfs.client.read.shortcircuit</name>
    <value>false</value>
  </property>
  <property>
    <name>dfs.domain.socket.path</name>
    <value>/var/run/hdfs-sockets/dn</value>
  </property>
  <property>
    <name>dfs.client.read.shortcircuit.skip.checksum</name>
    <value>false</value>
  </property>
  <property>
    <name>dfs.client.domain.socket.data.traffic</name>
    <value>false</value>
  </property>
  <property>
    <name>dfs.datanode.hdfs-blocks-metadata.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>dfs.client.block.write.replace-datanode-on-failure.enable</name>
    <value>true</value>
  </property>
  <property>
    <name>dfs.client.block.write.replace-datanode-on-failure.policy</name>
    <value>ALWAYS</value>
  </property>
  <property>
    <name>dfs.client.block.write.replace-datanode-on-failure.best-effort</name>
    <value>true</value>
  </property>
</configuration>

 

 

 

avatar

@Marek I tried my best to dug this issue for you. First of all I want to clear that this is not CDSW issue. This is the HDFS client configuration issue. I found some discrepancy:

  1. In the hdfs-site.xml file notice the dfs.ha.namenodes.[nameservice ID] section where you have declared the NN IDs as "namenode43,namenode57" which is causing the issue as indicated in the message.

 

<property>
    <name>dfs.ha.namenodes.namenodeHA</name>
    <value>namenode43,namenode57</value>
  </property>
"Namenode for namenodeHA remains unresolved for ID namenode57. Check your hdfs-site.xml file to ensure namenodes are configured properly."

 

So you should declare the IP or FQDN here because the session is not resolving this short name as there is no entry in the "/etc/hosts" file. Your "/etc/hosts" file is consist only below format:

---- RUN cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
IP address blc-control-02.novalocal
IP address blc-control-01.novalocal

👆This format is not recommended the Network names should be configured as below:

1.1.1.1  foo-1.example.com  foo-1
2.2.2.2  foo-2.example.com  foo-2
3.3.3.3  foo-3.example.com  foo-3
4.4.4.4  foo-4.example.com  foo-4

In the Ideal condition you should met below condition:

  • Your CDSW session should resolve the localhost as:
    !nslookup localhost
    Server: IP address Address: IP address#53 Name: localhost Address: 127.0.0.1 Name: localhost Address: ::1 
  • Your "/etc/hosts" file should be correct as per the Recommended Network names Configuration 
  • Your hdfs-site.xml should use FQDN or IP. 

NOTE: Having the short name entries of "namenode43,namenode57" in "/etc/hosts" file might work with your existing hdfs-site.xml but I would recommend to do that with FQDN way to avoid other issue.

 

Please modify and let me know how this goes, I am waiting for this issue to be resolved 🙂🤞 


Cheers!
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

avatar
Explorer

@GangWar I am confused – earlier you wrote that CDSW does not care about /etc/hosts file, and now that the short names should be declared in /etc/hosts file. What is the right statement?

 

Notwithstanding, if the CDSW hosts are managed by Cloudera Manager, shouldn't the latter take care about the relevant configuration of all the cluster hosts? In other words, if the CDH hosts in the cluster communicate correctly with the HDFS name nodes based on the hdfs-site.xml config file, then why the CDSW hosts don't?

 

Nevertheless, unfortunately the CDSW master host crashed and I was unable to restore it through Cloudera Manager. Tried to solve it by removing CDSW service from cluster, removing the CDSW host completely from cluster, destroying and creating a new VM for CDSW master, redeploying on it the requirements, adding back to CM and cluster. However now the problem is with adding CDSW service back to the cluster – the procedure gets stuck at running /opt/cloudera/parcels/CDSW/scripts/create-docker-thinpool.sh. It hangs at command:

 

lvcreate --wipesignatures y -n thinpool docker -l 95%VG

 

The procedure to add CDSW service continues and completes only if I terminate manually in CLI the aforementioned hanging lvcreate process (kill -2 <pid>). However the Docker Daemon service seems to malfunction as several service pods do not come up, incl. the CDSW web GUI.

 

CDSW_service_status.pngCDSW_service_status_error.png

CDSW_service_docker.png

avatar

@Marek CDSW doesn't consider /etc/hosts for the internal communications i.e with K8s and pods hierarchy. Here the issue is with HDFS side as stated in previous comment. 

Regarding the issue with lvcreate command you are hitting a known bug: https://access.redhat.com/solutions/1228673

You have to manually run the command form terminal (Not Kill) while starting only docker role then you have to start Master and Application role respectively and see if CDSW comes up. 


Cheers!
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

avatar
Explorer

@GangWar  Which command should I run manually from terminal, at what cluster hosts, and at which point in the overall procedure of adding CDSW service to the cluster?

 

Nonetheless, have removed the CDSW roles and host from the cluster and Cloudera Manager, created another clean VM, adjusted its config to meet the requirements, and added back CDSW service and its roles on the new host. Unfortunately the CDSW service reports the same errors as before and the web GUI is not accessible. The docker-thinpool logical volume has been created successfully, however the containers keep crashing/exiting:

CDSW_service_lvdisplay_docker.png

avatar

@Marek Try to start CDSW in below manner and then see the logs for failed PODs if there is any. 

1. Go to CM > CDSW > Stop (To stop CDSW first)
2. Go to CM > CDSW > Action > Run Prepare Node
3. Go to CM > CDSW > Instances
i) Select Docker role Only from Master host and start.
ii) Select Master Role from the Master host and start.
iii) Then select Application Role from Master host and start.
iv) Select Docker Role on Worker host and Start.
v) At last select all Worker Roles from CDSW Hosts and start.

Cheers!
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

avatar
Explorer

@GangWar Followed the steps jointly with a Cloudera representative (Kamel D). Unfortunately the problem is still there – several containers keep exiting.

avatar
Explorer

Have performed some further troubleshooting. As per the CDSW master-docker process stderr.log there might be a problem with Kubernetes DNS resolution due to missing weave containers for pod networking. Indeed, the DNS lookup cannot find/resolve one of the container repository's FQDN – docker-registry.infra.cloudera.com, which is supposed to hold the weave containers.CDSW_service_docker_errors.png

Are you in a position to verify and confirm, if that is the root cause?

avatar

@Marek No, that's a false alert, docker-registry shouldn't be accessible publicly. It's expected. The more worrisome part is this: 

I0831 17:16:26.129073 10863 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach
W0831 17:16:26.129340 10863 kubelet_node_status.go:481] Failed to set some node status fields: failed to validate nodeIP: Node IP: "Public-IP-Address" not found in the host's network interfaces
I0831 17:16:26.131194 10863 kubelet_node_status.go:447] Recording NodeHasSufficientMemory event message for node cdsw-master-01.novalocal
I0831 17:16:26.131229 10863 kubelet_node_status.go:447] Recording NodeHasNoDiskPressure event message for node cdsw-master-01.novalocal
I0831 17:16:26.131239 10863 kubelet_node_status.go:447] Recording NodeHasSufficientPID event message for node cdsw-master-01.novalocal
I0831 17:16:26.131255 10863 kubelet_node_status.go:72] Attempting to register node cdsw-master-01.novalocal
E0831 17:16:26.131750 10863 kubelet_node_status.go:94] Unable to register node "cdsw-master-01.novalocal" with API server: Post https://Public-IP-Address:6443/api/v1/nodes: dial tcp Public-IP-Address:6443: connect: connection refused
E0831 17:16:26.137202 10863 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:453: Failed to list *v1.Node: Get https://Public-IP-Address:6443/api/v1/nodes?fieldSelector=metadata.name%3Dcdsw-master-01.novalocal&limit=500&resourceVersion=0: dial tcp Public-IP-Address:6443: connect: connection refused
E0831 17:16:26.138323 10863 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:444: Failed to list *v1.Service: Get https://Public-IP-Address:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp Public-IP-Address:6443: connect: connection refused
E0831 17:16:26.139410 10863 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://Public-IP-Address:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dcdsw-master-01.novalocal&limit=500&resourceVersion=0: dial tcp Public-IP-Address:6443: connect: connection refused
E0831 17:16:26.184663 10863 kubelet.go:2266] node "cdsw-master-01.novalocal" not found
E0831 17:16:26.284919 10863 kubelet.go:2266] node "cdsw-master-01.novalocal" not found
E0831 17:16:26.385174 10863 kubelet.go:2266] node "cdsw-master-01.novalocal" not found
E0831 17:16:26.485392 10863 kubelet.go:2266] node "cdsw-master-01.novalocal" not found

While if I see in my cluster with a grep of "Successfully registered node" there are positive outputs but not in yours.

[DNroot@100.96 process]# rg "Successfully registered node"
2323-cdsw-CDSW_MASTER/logs/stderr.log
4919:I0824 07:47:48.639094 9038 kubelet_node_status.go:75] Successfully registered node host-10-17-xxx-xx

2213-cdsw-CDSW_MASTER/logs/stderr.log
4733:I0729 04:19:04.480413 26592 kubelet_node_status.go:75] Successfully registered node host-10-17-xxx-xx

2243-cdsw-CDSW_MASTER/logs/stderr.log.2
4629:I0729 04:25:25.114590 18483 kubelet_node_status.go:75] Successfully registered node host-10-17-xxx-xx

2182-cdsw-CDSW_MASTER/logs/stderr.log
5140:I0728 10:02:24.590379 11039 kubelet_node_status.go:75] Successfully registered node host-10-17-xxx-xx

2108-cdsw-CDSW_MASTER/logs/stderr.log
4761:I0707 03:34:20.010139 8360 kubelet_node_status.go:75] Successfully registered node host-10-17-xxx-xx

2123-cdsw-CDSW_MASTER/logs/stderr.log.1
4907:I0708 10:39:33.202393 6672 kubelet_node_status.go:75] Successfully registered node host-10-17-xxx-xx

2058-cdsw-CDSW_MASTER/logs/stderr.log
4747:I0703 04:18:25.271379 30286 kubelet_node_status.go:75] Successfully registered node host-10-17-xxx-xx

So again I am back to the network issue 🙂 


Cheers!
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

avatar

@Marek I think it's definitely network issue now. 

Node IP: "Public-IP-Address" not found in the host's network interfaces

This message would indicate to me that the ip address of the host machine has changed or not at least above IP at network interface level of this host. 

This thread is talked about the issue: https://github.com/kubernetes/kubernetes/issues/54337

The architecture which you are using is not supported, you might be able to hack thing using discussed in the thread:

Using --hostname-override=external-ip arguments for kubelet

 but not a long term solution. So you have to revise the network architecture is what I personally recommend to you as CDSW is little sensitive about this. 


Cheers!
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Labels