Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Help diagnosing zookeeper timeouts

Highlighted

Help diagnosing zookeeper timeouts

Super Collaborator

Hello,

We've recently been seeing some weird behavior from our cluster.

Things will work well for a day or two, and then Hive server and several region servers will go offline.

When I dig into the logs, they all reference zookeeper:

2019-05-24 20:12:15,108 ERROR nodes.PersistentEphemeralNode (PersistentEphemeralNode.java:deleteNode(323)) - Deleting node: /hiveserver2/serverUri=<servername>:10010;version=1.2.1000.2.6.1.0-129;sequence=0000000187
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hiveserver2/serverUri=<servername>:10010;version=1.2.1000.2.6.1.0-129;sequence=0000000187
       at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
       at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
       at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873)
       at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:239)
       at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:234)
       at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
       at org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:230)
       at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:215)
       at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:42)
       at org.apache.curator.framework.recipes.nodes.PersistentEphemeralNode.deleteNode(PersistentEphemeralNode.java:315)
       at org.apache.curator.framework.recipes.nodes.PersistentEphemeralNode.close(PersistentEphemeralNode.java:274)
       at org.apache.hive.service.server.HiveServer2$DeRegisterWatcher.process(HiveServer2.java:334)
       at org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:61)
       at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:534)
       at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
2019-05-24 20:12:15,110 ERROR server.HiveServer2 (HiveServer2.java:process(338)) - Failed to close the persistent ephemeral znode

However, when I look in the zookeeper logs, I don't see anything.

If I re-start the failed services, they will run for several hours, and then the process repeats.

We haven't changed any settings on the cluster, BUT, 2 things have changed recently:

1 - A couple weeks ago, some IT guys made a mistake and accidentally changed the /etc/hosts files

We fixed this, and re-started everything on the cluster.

2 - Those changes in (1) were part of some major network changes and we seem to have a lot more latency.

With all of that said, I really need some help figuring this out.

Could it be stale HBase wal files somewhere? Could that cause Hive server to fail?

Is there a zookeeper timeout setting I can change to help?

Any tips would be much appreciated.

4 REPLIES 4

Re: Help diagnosing zookeeper timeouts

Rising Star

The above was originally posted in the Community Help Track. On Sun May 26 01:17 UTC 2019, a member of the HCC moderation staff moved it to the Hadoop Core track. The Community Help Track is intended for questions about using the HCC site itself.

Re: Help diagnosing zookeeper timeouts

Mentor

@Zack Riesland

Once the hostname has been changed apart from updating the ambari related see HWX changing host names please strictly follow the document.

You will need to check the below properties updating the ambari.properties

server.jdbc.hostname=
server.jdbc.rca.url=
server.jdbc.url=

For hive, oozie and ranger related you will need to update these if the host where the databases for these components was also changed.

As root user connect to your Ambari database backend and execute

grant all privileges on hive.* to 'hive'@'new_host' identified by 'hive_password';
grant all privileges on hive.* to 'hive'@'new_host' with grant option;

Run the above for all affected service mentioned above.

You could be having an hbase to hive schema mapping issue! Please revert


Re: Help diagnosing zookeeper timeouts

Super Collaborator

Thanks @Geoffrey Shelton Okot

Just to clarify, we corrected all the hosts files and re-started all the services.

I have a hunch that there are is some hbase data somewhere that is now corrupt because it is associated with the incorrect fqdn.

But I wouldn't expect hive to have any relationship to hbase.

Does zookeeper use hbase for record keeping?

Re: Help diagnosing zookeeper timeouts

Super Collaborator

It took me a while to look in /var/log/messages, but I found a ton of ntpd errors.

It turns out that our nodes were having issues getting out to the servers they were configured to use for sync.

I switched all the configurations to use a local premise server and restarted everything.

I'm hoping that will be the full solution to our issue.