04-25-2019 05:16 PM - last edited on 04-26-2019 06:04 AM by cjervis
Hoping someone can help.
I'm getting this error when impalad connects:
RPC error: Client for <hostname>:23000 hits an unexpected exception authorize: cannot authorize peer, type: N6apache6thrift9transport13TSSLException
I get a similar error when SSL is disabled.
the connect/disconnect cycle repeats about every 1/2 second forever.
The log does include a message that the cert was successfully loaded so I don't think it's a problem with that, but I'm open to any suggestions.
Is the type: noted above literally referring to a connection to an HBase thrift server?
Anyone know more about this specific error message?
04-28-2019 03:35 AM
It looks like you have a firewall issue between your nodes.. try to check iptables.
Please share with us the impalad log files.
04-28-2019 12:21 PM
04-30-2019 08:39 AM
PS - I have to go through some hoops to get the logs out. Hand-typing what I think is relevant takes some time. I'm hoping to have something soon.
As a note, we have SSL enabled.
04-30-2019 02:13 PM
Hi @bridgor , if we were just to take the N6apache6thrift9transport13TSSLException originally posted, this typically ends in a hostname mistmatch in the .pem files used / DNS problems. Check statestore logs / web UI for issues as well. I agree with @AcharkiMed that this seems to be network / SSL related.
Robert Justice, Technical Resolution Manager
05-02-2019 08:54 AM
Hi @bridgor ,
You may want to check out this thread: https://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Impala-Catalogue-server-down-after-upg...,
I don't know what Impala version you are on (may have missed it), so I can't decide it the thread is directly applicable to your problem, but it looks remarkably similar.
05-02-2019 10:03 AM
Copy thanks. It is close. We are now using FQDN and the errors are s little different now. So, here is an additional question>
DNS us not running here so when we use the FQDN, is any part of CDH attempting to resolve the FQDN using DNS? Or is the FQDN for some other purpose.
Our install was generally working better when we were using only hostnames. FQDN seems to have complicated the issue - we made up a generic but consistent FQDN. Is Cloudera looking for the domain and having problems when it can't find it?
Hoping to understand how the FQDN is actually being used as this represents a shift in the original problem statement.
05-03-2019 02:25 PM
I would say this depends on if your hostname is set FQDN at time of CM/cluster install, if your hostname is set as fully qualified in the OS by the hostname command, or if you are still using TLS/SSL your certificates have the FQDN hostname. Kerberos and several other things are very sensitive to DNS. Run the hosts inspector under CM to check for DNS / host resolution problems.
Since you state you are not using DNS, what I would suggest is making sure your /etc/hosts file on all hosts contains all hosts of the cluster and CM, and are set as fully qualified hostname first, then aliased to the short hostname. You can use rsync to keep this file accurate across the cluster. Also make sure /etc/nsswitch.conf has files first for the hosts: line, and /etc/hosts will get used first. Finally, if you supect the internal hostname was changed away from FQDN to short, either change it back or follow the following article to get CM configuration back in sync with what it was at time of install (Check the Name column beneath CM hosts tab to see what CM has in it's database):
[root@cm1 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.2.3 cm1.rgjustice.com cm1
192.168.2.4 node1.rgjustice.com node1
192.168.2.5 node2.rgjustice.com node2
[root@cm1 ~]# cat /etc/nsswitch.conf |grep hosts
#hosts: db files nisplus nis dns
hosts: files dns myhostname
Robert Justice, Technical Resolution Manager
05-06-2019 03:27 PM
Thanks to everyone who replied. It turns out that references to truststores and server keys, etc., and associated passwords may be cached, so when we changed these after moving the cluster, creating new cerrts and replacing the passwords in CDH was insufficient.
So, after DELETING all fields containing passwords, cert locations, key locations, etc.,unchecking SSL, restarting the cluster, and adding the references back in, everything works. Uugghhh - who knew! :)