Hi @pearlraj36 , I guess it just rounds down to install presto & configuring it following official docs: https://prestodb.io/docs/current/installation/deployment.html Then, you need all CDH configuration files to be available to presto, my best bet would be setting up presto service on either your worker nodes or edge nodes, so you don't have to worry about manually managing changes to the config files. I'll be trying this soon for a customer, so if anything different comes up i'll try to update this post Cheers, Matteo
... View more
Hello Everyone, posting an emergency recovery guide I created and i've been using myself for the title situation (tried to add it to https://community.cloudera.com/t5/Support-Questions/Namenode-Txid-Error/td-p/99549 but it won't give me the option to add a reply) Disclaimer: this is not intended as an alternative to restoring from a backup of your NNs and JNs. Always go for that before trying this; this is intended as an emergency/last-try before formatting the HDFS and losing all the data: you might experience some data loss, but some is still better than all imo. I tested it twice with some pretty-scambled-up env (in my case, we passed from a working cloudera manager 6 with CDH 6.3.2 on SSL + Kerberos + Sentry configs to a CM 6 without any software installed, SSL partially disabled, Kerberos & modules configs reverted back in time to 7 months earlier probably due to some internal hijack) As an assumption, you need to have HDFS HA enabled at the moment of crash, even if your NNs are not working you're almost sure that at least one of your JNs is healthy. Also when following this procedure, always back up files/dirs you're editing, you might regret it later if you didn't done it The error i came up when try restarting the NN is this We expected txid 2104209346, but got txid 2104208991. And also from the logs i saw that NNs couldn't communicate & get in sync with JNs. The first thing to inspect is what is the JN with the most recent edits saved on the disk: you can check it searching for the edit file with those ID reported above in the txid error. When you find the JN with those edits, backup the whole "current" directory of the other JNs which doesn't have those, then proceed to copy the whole "current" directory of the healthy JN to the other JNs you have configured. Beware of the permissions/owners of the files Check also the VERSION file and the seen_txid/committed txid of both NNs and JNs istances, they must be in sync: you can aswell edit the VERSION/txid files by yourself, but always save a backup copy before editing anything. After that, you can try starting JNs and one NameNode If you chose a JNs which hasn't the edit logs required, after starting JNs and the NameNode, you should get this “there appears to be a gap throught edit files version X” --> this means NN can't match VERSION file, txid and edit logs of the JNs. You might aswell try to copy every JN directory to every other JN in round-robin fashion, to see which one is the healthy one If done correctly, after a while of re-creating the fsimage reading the edit logs, you can start all DataNode istances and the NN should come up properly active and healthy, so your HDFS should be available again with every data accessible. After confirming everything is OK, you can start the other NameNode In my case, i got the other NN not starting with this error cloudera Cannot skip to less than the current value (=1414885646), where newValue=1414885603 I solved it by copying the whole "current" directory of the NN to the other non-healthy NN and restarting it Again, this is not something i've found officially documented and i'm sharing only for the sake of "trying everything you can before declaring data loss and forced to formatting HDFS". I suggest to search for documents about HDFS and always go for backups, which you should always take regularly I've followed this on both virtual (on docker) and on-premise environments. Hope this could help someone Have a great day everyone
... View more
Need some Kerberos/Hbase/Hue guru here.
We planned an upgrade from a working environment in CDH 5.14.0 with full stack security (TLS/Kerberos/Sentry) to CDH 6.3.2 due to HUE-8344
Since we were also using spark 2.4 as CSD and CDK 4.1 the perfect match was latest 6.3 version
All good following the documentation, but just before finalizing HDFS metadata (which should be one of the latest thing to do after confirming everything is ok), we tried Hbase App inside Hue, and the App isn't now working at all
The only issue showing app in Hue Hbase app is "Api Error: Authentication Error"
I've read a few post in that regard and tried a whole different kind of stuff to fix that, which apparently seems like A) the Hbase Thrift server isn't starting properly in kerberos auth or B) the Hue plugin isn't working as intended with Thrift Server: i'm attaching the DEBUG log of Thrift server while trying to reach it from Hue in this very post
Linking for reference every TN/thread i've followed while trying to get out that situation (beware: of course before trying the debug way, i've reviewed every official doc from cloudera and hue websites)
https://stackoverflow.com/questions/31152568/hue-hbase-api-error-none https://community.cloudera.com/t5/Support-Questions/Hue-HBase-error-Api-Error-The-kerberos-principal-name-is/td-p/63545 https://community.cloudera.com/t5/Support-Questions/I-am-getting-an-api-error-when-accessing-hbase-browser/td-p/37188
https://community.cloudera.com/t5/Support-Questions/Hue-hbase-Api-Error-TSocket-read-0-bytes/td-p/21070 (this is an old post so it quite not match my case) https://github.com/cloudera/hue/issues/702 I've tried lot of things based on what i've found on the community, like disabling the impersonation, change the hbase thrift security to "none" "auth" "auth-int", i've also tried to add Kerberos algorythms based on another thread i found here, i've also directly edited the sh hbase scripts (for the sake of debugging, of course) to force the static pointing to all kerberos-related files (for reference, jaas.conf, kerberos principal and keytab to use) also using safety valves as needed
At the moment I've rollbacked every setting to what was before the upgrading (and working), attaching some main settings
What is really making me confused is that lines found in thrift server
2019-12-04 15:45:54,416 DEBUG org.apache.hadoop.security.UserGroupInformation: PrivilegedAction as:hbase (auth:SIMPLE) from:org.apache.hadoop.hbase.thrift.ThriftHttpServlet.doKerberosAuth(ThriftHttpServlet.java:162) 2019-12-04 15:45:54,416 DEBUG org.apache.hadoop.security.UserGroupInformation: PrivilegedActionException as:hbase (auth:SIMPLE) cause:org.apache.hadoop.hbase.thrift.HttpAuthenticationException: Kerberos authentication failed: 2019-12-04 15:45:54,416 INFO org.apache.hadoop.hbase.thrift.ThriftHttpServlet: Failed to authenticate with hbase kerberos principal
Of course both thrift and REST hbase security settings are "Kerberos" and not "Simple"
Sharing the rest of log in this post
Hope you have some good hint for me since we did the upgrade on purpose to use that Hbase App inside hue, and having it completely broken is very disappointing
Thank you for the attention and have a nice day,
... View more
Hello @lwang , Many thanks for your answer. I've just performed the command: # pidof chronyd on my machine and I have not installed cronyd on the machines due to no PID returned. If I understand well this is like a tool used to sync the clock of the Operative system to the NTP server clock. Do you think it could be a good thing to install it for a permanent fix of the ntp clock issue ? Or maybe the restart of the ntpd service? Today I have still received the alert but after 20-30 minutes spontaneously everything returned green then it's not a serious block problem. Many thanks in advance I'm available for any clarification Regards, Teolux Teolux
... View more
I have a test Cloudera Cluster composed by three servers(one with the cloudera manager and 2 data nodes). Sometime, but no more than twice a day, it happens that arrives an alert about Clock offset bad for a node. Once I log in into the server of the alert I have checked the service ntp with the command service ntp status and it seems to be ok giving me this output telling me is running:
Oct 30 12:46:25 mbesrvcdrmw1.mbeitaly.mbe.lan ntpd: 0.0.0.0 0615 05 clock_sync Oct 30 12:46:26 mbesrvcdrmw1.mbeitaly.mbe.lan ntpd: 0.0.0.0 c618 08 no_sys_peer
I have tried to execute this command ntpq -np on all three servers and, for every one, there's in the output " * " in the beginning of a server line demonstrating the sync it's active.
I have also tried to restart the ntp service with service ntpd restart and it seems to work but sometime the error reappears but spontaneously, after 15-20 minutes, fix itself automatically.
You can find in the attach the full output of these two commands.
Do you have any ideas about how to fix this matter permanently ?
Many thanks in advance
... View more