question Re: YARN - Zookeeper failing a few moments after restart in Support Questions

YARN - Zookeeper failing a few moments after restart

ray_teruya — Sat, 17 Aug 2019 23:26:30 GMT

Good morning guys, thanks in advance for your help!

I have a project that fails. I'm trying to restart all the services manually but havent been able to.
I have a few questions and I'd really appreciate if you can give me some guidance because at this moment I'm kinda stuck.

1. How do I check what services need to be "up and running" before restarting the next one? Is there any place where I can see the dependency?
2. Do I need 2 ZooKeeper servers up and running? The first one is running in localhost but the 2nd one runs in a different machine. If I actually need them both, how can I check what was wrong in the second one?

Re: YARN - Zookeeper failing a few moments after restart

Shelton — Sat, 17 Aug 2019 23:26:22 GMT

@Ray Teruya

Start-all-services-from-Ambari

Start all services. Use Ambari UI > Services > Start All to start all services at once. In Ambari UI > Services you can start, stop, and restart all listed services simultaneously. In Services, click ... and then click Start All.

The first place to check for start failures or success in /var/logs/zookeeper/zookeeper.log or zookeeper-zookeeper-server-[hostname].out

According to HWX documentation make sure to manually start the Hadoop services in this prescribed order

1. How do I check what services need to be "up and running" before restarting the next one? Is there any place where I can see the dependency?

The above gives you the list and order of dependency

2. Do I need 2 ZooKeeper servers up and running? The first one is running in localhost but the 2nd one runs in a different machine. If I actually need them both, how can I check what was wrong in the second one?

If you are not run an HA configuration a single zookeeper suffice, but if you want to emulate a production environment with many data nodes to enable [HA Namenode or RM] you MUST have at least 3 zookeepers to avoid the split-brain phenomenon

Hope that helps

Re: YARN - Zookeeper failing a few moments after restart

ask_bill_brooks — Wed, 31 Jul 2019 10:11:11 GMT

The above question was originally posted in the Community Help track. On Wed Jul 31 03:08 UTC 2019, a member of the HCC moderation staff moved it to the Cloud & Operations track. The Community Help Track is intended for questions about using the HCC site itself, not for technical questions about using Ambari.

Re: YARN - Zookeeper failing a few moments after restart

Shelton — Wed, 31 Jul 2019 20:36:55 GMT

@Ray Teruya

How many hosts do you have in your cluster? Can you share your zookeeper logs and your /etc/hosts?

HTH

Re: YARN - Zookeeper failing a few moments after restart

ray_teruya — Thu, 01 Aug 2019 00:32:10 GMT

Thank again @Geoffrey Shelton Okot

I have 4 machines in the cluster (1 master 3 slaves)
Zookeeper server is installed in the master (works fine) and in one slave (fails)

This is the log I get from the slave
zookeeperLOG.txt

Re: YARN - Zookeeper failing a few moments after restart

ray_teruya — Sat, 17 Aug 2019 23:26:14 GMT

Thanks @Geoffrey Shelton Okot for your help!

I did restart all services manually but seems that ZK still fails. From the screenshot I posted, one of my ZK servers is always down. Since ZK needs to be up and running before anything else, I'd like to fix this issue before anything else. I checked the error message from Ambari and it says

The error message says
Connection failed: [Errno 111] Connection refused to ip_zookeeper_server2:2181

What else can I do to fix this?

update:
This is what the log file shows on that machine

2019-07-31 07:57:58,187 - INFO [main:QuorumPeerConfig@103] - Reading configuration from: /usr/hdp/current/zookeeper-server/conf/zoo.cfg
2019-07-31 07:57:58,191 - WARN [main:QuorumPeerConfig@291] - No server failure will be tolerated. You need at least 3 servers.
2019-07-31 07:57:58,191 - INFO [main:QuorumPeerConfig@338] - Defaulting to majority quorums
2019-07-31 07:57:58,196 - INFO [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 30
2019-07-31 07:57:58,197 - INFO [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 24
2019-07-31 07:57:58,198 - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
2019-07-31 07:57:58,210 - INFO [main:QuorumPeerMain@127] - Starting quorum peer
2019-07-31 07:57:58,219 - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.
2019-07-31 07:57:58,223 - INFO [main:NIOServerCnxnFactory@94] - binding to port 0.0.0.0/0.0.0.0:2181
2019-07-31 07:57:58,233 - INFO [main:QuorumPeer@992] - tickTime set to 2000
2019-07-31 07:57:58,233 - INFO [main:QuorumPeer@1012] - minSessionTimeout set to -1
2019-07-31 07:57:58,233 - INFO [main:QuorumPeer@1023] - maxSessionTimeout set to -1
2019-07-31 07:57:58,233 - INFO [main:QuorumPeer@1038] - initLimit set to 10
2019-07-31 07:57:58,245 - INFO [main:FileSnap@83] - Reading snapshot /hadoop/zookeeper/version-2/snapshot.8600bc40ab
2019-07-31 07:58:41,800 - ERROR [main:NIOServerCnxnFactory$1@44] - Thread Thread[main,5,main] died
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:97)
at org.apache.zookeeper.server.DataNode.deserialize(DataNode.java:158)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
at org.apache.zookeeper.server.DataTree.deserialize(DataTree.java:1194)
at org.apache.zookeeper.server.util.SerializeUtils.deserializeSnapshot(SerializeUtils.java:127)
at org.apache.zookeeper.server.persistence.FileSnap.deserialize(FileSnap.java:127)
at org.apache.zookeeper.server.persistence.FileSnap.deserialize(FileSnap.java:87)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:483)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:473)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:153)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)

Re: YARN - Zookeeper failing a few moments after restart

ray_teruya — Thu, 01 Aug 2019 02:21:21 GMT

After checking today's log file I found this.
Will google it to see what it means

2019-07-31 07:57:58,187 - INFO [main:QuorumPeerConfig@103] - Reading configuration from: /usr/hdp/current/zookeeper-server/conf/zoo.cfg
2019-07-31 07:57:58,191 - WARN [main:QuorumPeerConfig@291] - No server failure will be tolerated. You need at least 3 servers.
2019-07-31 07:57:58,191 - INFO [main:QuorumPeerConfig@338] - Defaulting to majority quorums
2019-07-31 07:57:58,196 - INFO [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 30
2019-07-31 07:57:58,197 - INFO [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 24
2019-07-31 07:57:58,198 - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
2019-07-31 07:57:58,210 - INFO [main:QuorumPeerMain@127] - Starting quorum peer
2019-07-31 07:57:58,219 - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.
2019-07-31 07:57:58,223 - INFO [main:NIOServerCnxnFactory@94] - binding to port 0.0.0.0/0.0.0.0:2181
2019-07-31 07:57:58,233 - INFO [main:QuorumPeer@992] - tickTime set to 2000
2019-07-31 07:57:58,233 - INFO [main:QuorumPeer@1012] - minSessionTimeout set to -1
2019-07-31 07:57:58,233 - INFO [main:QuorumPeer@1023] - maxSessionTimeout set to -1
2019-07-31 07:57:58,233 - INFO [main:QuorumPeer@1038] - initLimit set to 10
2019-07-31 07:57:58,245 - INFO [main:FileSnap@83] - Reading snapshot /hadoop/zookeeper/version-2/snapshot.8600bc40ab
2019-07-31 07:58:41,800 - ERROR [main:NIOServerCnxnFactory$1@44] - Thread Thread[main,5,main] died
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:97)
at org.apache.zookeeper.server.DataNode.deserialize(DataNode.java:158)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
at org.apache.zookeeper.server.DataTree.deserialize(DataTree.java:1194)
at org.apache.zookeeper.server.util.SerializeUtils.deserializeSnapshot(SerializeUtils.java:127)
at org.apache.zookeeper.server.persistence.FileSnap.deserialize(FileSnap.java:127)
at org.apache.zookeeper.server.persistence.FileSnap.deserialize(FileSnap.java:87)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:483)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:473)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:153)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)

Re: YARN - Zookeeper failing a few moments after restart

Shelton — Sat, 17 Aug 2019 23:26:06 GMT

@Ray Teruya

OutOfMemoryError is a subclass of java.lang.VirtualMachineError; it’s thrown by the JVM when it encounters a problem related to utilizing resources. More specifically, the error occurs when the JVM spent too much time performing Garbage Collection and was only able to reclaim very little heap space.

According to Java docs, by default, the JVM is configured to throw this error if the Java process spends more than 98% of its time doing GC and when only less than 2% of the heap is recovered in each run. In other words, this means that our application has exhausted nearly all the available memory and the Garbage Collector has spent too much time trying to clean it and failed repeatedly.

In this situation, users experience extreme slowness of the application. Certain operations, which usually complete in milliseconds, take more time to complete. This is because the CPU is using its entire capacity for Garbage Collection and hence cannot perform any other tasks.

Solution:

On HDP 3.x & 2.6.x depending on the memory available to the cluster check and increase the below

You could throttle it to 2048 MB

HTH

Re: YARN - Zookeeper failing a few moments after restart

Shelton — Mon, 12 Aug 2019 01:16:53 GMT

@Ray Teruya

The error in BOLD below is what I stated in Question/Answer 2 in my former post. To avoid the split-brain decision you MUST install 3 zookeepers

2019-07-31 07:57:58,191 - WARN [main:QuorumPeerConfig@291] - No server failure will be tolerated. You need at least 3 servers.

Solution

Delete/remove the failed installation.

Add 2 new zk using Ambari UI in your cluster using ADD SERVICE, start the new zookeepers if they ain't started, this should form a quorum where only one is a leader and the rest are followers. To identify a Zookeeper leader/follower, there are few possible options. Mentioning 2 for keeping this document simple. 2. Use "nc" command to listen to TCP communication on port 2181 and determine if the ZooKeeper server is a leader or a follower.

1. Check the zookeeper log file on each node, and grep as below:

# grep LEAD /var/log/zookeeper/zookeeper-zookeeper-server-xyz.out

Desired output

2019-08-10 22:33:47,113 - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:QuorumPeer@829] - LEADING

2019-08-10 22:33:47,114 - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Leader@358] - LEADING - LEADER ELECTION TOOK - 9066

After doing the above procedure you should be good to go.

HTH

Re: YARN - Zookeeper failing a few moments after restart

ray_teruya — Tue, 13 Aug 2019 22:55:40 GMT

Wow thanks @Geoffrey Shelton Okot and sorry for the late response. Changing the maximum memory value did the job. Now we're checking that it stays stable. So far so good!

Re: YARN - Zookeeper failing a few moments after restart

Shelton — Sun, 18 Aug 2019 08:27:16 GMT

@ray_teruya

If you found this answer addressed your question, please take a moment to log in and click the "kudos" link on the answer.

That would be a great help to Community users to find the solution quickly for these kinds of errors.