We've recently been seeing some weird behavior from our cluster.
Things will work well for a day or two, and then Hive server and several region servers will go offline.
When I dig into the logs, they all reference zookeeper:
2019-05-24 20:12:15,108 ERROR nodes.PersistentEphemeralNode (PersistentEphemeralNode.java:deleteNode(323)) - Deleting node: /hiveserver2/serverUri=<servername>:10010;version=1.2.1000.2.6.1.0-129;sequence=0000000187 org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hiveserver2/serverUri=<servername>:10010;version=1.2.1000.2.6.1.0-129;sequence=0000000187 at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873) at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:239) at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:234) at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) at org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:230) at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:215) at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:42) at org.apache.curator.framework.recipes.nodes.PersistentEphemeralNode.deleteNode(PersistentEphemeralNode.java:315) at org.apache.curator.framework.recipes.nodes.PersistentEphemeralNode.close(PersistentEphemeralNode.java:274) at org.apache.hive.service.server.HiveServer2$DeRegisterWatcher.process(HiveServer2.java:334) at org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:61) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:534) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) 2019-05-24 20:12:15,110 ERROR server.HiveServer2 (HiveServer2.java:process(338)) - Failed to close the persistent ephemeral znode
However, when I look in the zookeeper logs, I don't see anything.
If I re-start the failed services, they will run for several hours, and then the process repeats.
We haven't changed any settings on the cluster, BUT, 2 things have changed recently:
1 - A couple weeks ago, some IT guys made a mistake and accidentally changed the /etc/hosts files
We fixed this, and re-started everything on the cluster.
2 - Those changes in (1) were part of some major network changes and we seem to have a lot more latency.
With all of that said, I really need some help figuring this out.
Could it be stale HBase wal files somewhere? Could that cause Hive server to fail?
Is there a zookeeper timeout setting I can change to help?
Any tips would be much appreciated.
Once the hostname has been changed apart from updating the ambari related see HWX changing host names please strictly follow the document.
You will need to check the below properties updating the ambari.properties
server.jdbc.hostname= server.jdbc.rca.url= server.jdbc.url=
For hive, oozie and ranger related you will need to update these if the host where the databases for these components was also changed.
As root user connect to your Ambari database backend and execute
grant all privileges on hive.* to 'hive'@'new_host' identified by 'hive_password'; grant all privileges on hive.* to 'hive'@'new_host' with grant option;
Run the above for all affected service mentioned above.
You could be having an hbase to hive schema mapping issue! Please revert
Thanks @Geoffrey Shelton Okot
Just to clarify, we corrected all the hosts files and re-started all the services.
I have a hunch that there are is some hbase data somewhere that is now corrupt because it is associated with the incorrect fqdn.
But I wouldn't expect hive to have any relationship to hbase.
Does zookeeper use hbase for record keeping?
It took me a while to look in /var/log/messages, but I found a ton of ntpd errors.
It turns out that our nodes were having issues getting out to the servers they were configured to use for sync.
I switched all the configurations to use a local premise server and restarted everything.
I'm hoping that will be the full solution to our issue.