Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

YARN - Zookeeper failing a few moments after restart

avatar
Explorer

Good morning guys, thanks in advance for your help!

I have a project that fails. I'm trying to restart all the services manually but havent been able to.
I have a few questions and I'd really appreciate if you can give me some guidance because at this moment I'm kinda stuck.


1. How do I check what services need to be "up and running" before restarting the next one? Is there any place where I can see the dependency?
2. Do I need 2 ZooKeeper servers up and running? The first one is running in localhost but the 2nd one runs in a different machine. If I actually need them both, how can I check what was wrong in the second one?

110104-ambarierrors.png

1 ACCEPTED SOLUTION

avatar
Master Mentor

@Ray Teruya

OutOfMemoryError is a subclass of java.lang.VirtualMachineError; it’s thrown by the JVM when it encounters a problem related to utilizing resources. More specifically, the error occurs when the JVM spent too much time performing Garbage Collection and was only able to reclaim very little heap space.

110185-1564828294767.png

According to Java docs, by default, the JVM is configured to throw this error if the Java process spends more than 98% of its time doing GC and when only less than 2% of the heap is recovered in each run. In other words, this means that our application has exhausted nearly all the available memory and the Garbage Collector has spent too much time trying to clean it and failed repeatedly.

In this situation, users experience extreme slowness of the application. Certain operations, which usually complete in milliseconds, take more time to complete. This is because the CPU is using its entire capacity for Garbage Collection and hence cannot perform any other tasks.

Solution:


On HDP 3.x & 2.6.x depending on the memory available to the cluster check and increase the below

110193-1564829131326.png

You could throttle it to 2048 MB

HTH

View solution in original post

10 REPLIES 10

avatar
Master Mentor

@Ray Teruya

Start-all-services-from-Ambari

Start all services. Use Ambari UI > Services > Start All to start all services at once. In Ambari UI > Services you can start, stop, and restart all listed services simultaneously. In Services, click ... and then click Start All.

110131-taruya.png


The first place to check for start failures or success in /var/logs/zookeeper/zookeeper.log or zookeeper-zookeeper-server-[hostname].out

According to HWX documentation make sure to manually start the Hadoop services in this prescribed order

1. How do I check what services need to be "up and running" before restarting the next one? Is there any place where I can see the dependency?

The above gives you the list and order of dependency


2. Do I need 2 ZooKeeper servers up and running? The first one is running in localhost but the 2nd one runs in a different machine. If I actually need them both, how can I check what was wrong in the second one?

If you are not run an HA configuration a single zookeeper suffice, but if you want to emulate a production environment with many data nodes to enable [HA Namenode or RM] you MUST have at least 3 zookeepers to avoid the split-brain phenomenon


Hope that helps

avatar

The above question was originally posted in the Community Help track. On Wed Jul 31 03:08 UTC 2019, a member of the HCC moderation staff moved it to the Cloud & Operations track. The Community Help Track is intended for questions about using the HCC site itself, not for technical questions about using Ambari.

Bill Brooks, Community Moderator
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

avatar
Master Mentor

@Ray Teruya

How many hosts do you have in your cluster? Can you share your zookeeper logs and your /etc/hosts?

HTH

avatar
Explorer

Thank again @Geoffrey Shelton Okot

I have 4 machines in the cluster (1 master 3 slaves)
Zookeeper server is installed in the master (works fine) and in one slave (fails)

This is the log I get from the slave
zookeeperLOG.txt

avatar
Explorer

Thanks @Geoffrey Shelton Okot for your help!

I did restart all services manually but seems that ZK still fails. From the screenshot I posted, one of my ZK servers is always down. Since ZK needs to be up and running before anything else, I'd like to fix this issue before anything else. I checked the error message from Ambari and it says


110143-zk-server-failing.png


The error message says
Connection failed: [Errno 111] Connection refused to ip_zookeeper_server2:2181


What else can I do to fix this?


update:
This is what the log file shows on that machine

2019-07-31 07:57:58,187 - INFO [main:QuorumPeerConfig@103] - Reading configuration from: /usr/hdp/current/zookeeper-server/conf/zoo.cfg
2019-07-31 07:57:58,191 - WARN [main:QuorumPeerConfig@291] - No server failure will be tolerated. You need at least 3 servers.
2019-07-31 07:57:58,191 - INFO [main:QuorumPeerConfig@338] - Defaulting to majority quorums
2019-07-31 07:57:58,196 - INFO [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 30
2019-07-31 07:57:58,197 - INFO [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 24
2019-07-31 07:57:58,198 - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
2019-07-31 07:57:58,210 - INFO [main:QuorumPeerMain@127] - Starting quorum peer
2019-07-31 07:57:58,219 - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.
2019-07-31 07:57:58,223 - INFO [main:NIOServerCnxnFactory@94] - binding to port 0.0.0.0/0.0.0.0:2181
2019-07-31 07:57:58,233 - INFO [main:QuorumPeer@992] - tickTime set to 2000
2019-07-31 07:57:58,233 - INFO [main:QuorumPeer@1012] - minSessionTimeout set to -1
2019-07-31 07:57:58,233 - INFO [main:QuorumPeer@1023] - maxSessionTimeout set to -1
2019-07-31 07:57:58,233 - INFO [main:QuorumPeer@1038] - initLimit set to 10
2019-07-31 07:57:58,245 - INFO [main:FileSnap@83] - Reading snapshot /hadoop/zookeeper/version-2/snapshot.8600bc40ab
2019-07-31 07:58:41,800 - ERROR [main:NIOServerCnxnFactory$1@44] - Thread Thread[main,5,main] died
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:97)
at org.apache.zookeeper.server.DataNode.deserialize(DataNode.java:158)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
at org.apache.zookeeper.server.DataTree.deserialize(DataTree.java:1194)
at org.apache.zookeeper.server.util.SerializeUtils.deserializeSnapshot(SerializeUtils.java:127)
at org.apache.zookeeper.server.persistence.FileSnap.deserialize(FileSnap.java:127)
at org.apache.zookeeper.server.persistence.FileSnap.deserialize(FileSnap.java:87)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:483)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:473)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:153)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)


avatar
Explorer

After checking today's log file I found this.
Will google it to see what it means


2019-07-31 07:57:58,187 - INFO [main:QuorumPeerConfig@103] - Reading configuration from: /usr/hdp/current/zookeeper-server/conf/zoo.cfg
2019-07-31 07:57:58,191 - WARN [main:QuorumPeerConfig@291] - No server failure will be tolerated. You need at least 3 servers.
2019-07-31 07:57:58,191 - INFO [main:QuorumPeerConfig@338] - Defaulting to majority quorums
2019-07-31 07:57:58,196 - INFO [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 30
2019-07-31 07:57:58,197 - INFO [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 24
2019-07-31 07:57:58,198 - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
2019-07-31 07:57:58,210 - INFO [main:QuorumPeerMain@127] - Starting quorum peer
2019-07-31 07:57:58,219 - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.
2019-07-31 07:57:58,223 - INFO [main:NIOServerCnxnFactory@94] - binding to port 0.0.0.0/0.0.0.0:2181
2019-07-31 07:57:58,233 - INFO [main:QuorumPeer@992] - tickTime set to 2000
2019-07-31 07:57:58,233 - INFO [main:QuorumPeer@1012] - minSessionTimeout set to -1
2019-07-31 07:57:58,233 - INFO [main:QuorumPeer@1023] - maxSessionTimeout set to -1
2019-07-31 07:57:58,233 - INFO [main:QuorumPeer@1038] - initLimit set to 10
2019-07-31 07:57:58,245 - INFO [main:FileSnap@83] - Reading snapshot /hadoop/zookeeper/version-2/snapshot.8600bc40ab
2019-07-31 07:58:41,800 - ERROR [main:NIOServerCnxnFactory$1@44] - Thread Thread[main,5,main] died
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:97)
at org.apache.zookeeper.server.DataNode.deserialize(DataNode.java:158)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
at org.apache.zookeeper.server.DataTree.deserialize(DataTree.java:1194)
at org.apache.zookeeper.server.util.SerializeUtils.deserializeSnapshot(SerializeUtils.java:127)
at org.apache.zookeeper.server.persistence.FileSnap.deserialize(FileSnap.java:127)
at org.apache.zookeeper.server.persistence.FileSnap.deserialize(FileSnap.java:87)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:483)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:473)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:153)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)

avatar
Master Mentor

@Ray Teruya

OutOfMemoryError is a subclass of java.lang.VirtualMachineError; it’s thrown by the JVM when it encounters a problem related to utilizing resources. More specifically, the error occurs when the JVM spent too much time performing Garbage Collection and was only able to reclaim very little heap space.

110185-1564828294767.png

According to Java docs, by default, the JVM is configured to throw this error if the Java process spends more than 98% of its time doing GC and when only less than 2% of the heap is recovered in each run. In other words, this means that our application has exhausted nearly all the available memory and the Garbage Collector has spent too much time trying to clean it and failed repeatedly.

In this situation, users experience extreme slowness of the application. Certain operations, which usually complete in milliseconds, take more time to complete. This is because the CPU is using its entire capacity for Garbage Collection and hence cannot perform any other tasks.

Solution:


On HDP 3.x & 2.6.x depending on the memory available to the cluster check and increase the below

110193-1564829131326.png

You could throttle it to 2048 MB

HTH

avatar
Master Mentor

@Ray Teruya

The error in BOLD below is what I stated in Question/Answer 2 in my former post. To avoid the split-brain decision you MUST install 3 zookeepers

2019-07-31 07:57:58,191 - WARN [main:QuorumPeerConfig@291] - No server failure will be tolerated. You need at least 3 servers.

Solution

Delete/remove the failed installation.

Add 2 new zk using Ambari UI in your cluster using ADD SERVICE, start the new zookeepers if they ain't started, this should form a quorum where only one is a leader and the rest are followers. To identify a Zookeeper leader/follower, there are few possible options. Mentioning 2 for keeping this document simple. 2. Use "nc" command to listen to TCP communication on port 2181 and determine if the ZooKeeper server is a leader or a follower.

1. Check the zookeeper log file on each node, and grep as below:

# grep LEAD /var/log/zookeeper/zookeeper-zookeeper-server-xyz.out

Desired output

2019-08-10 22:33:47,113 - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:QuorumPeer@829] - LEADING

2019-08-10 22:33:47,114 - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Leader@358] - LEADING - LEADER ELECTION TOOK - 9066

After doing the above procedure you should be good to go.

HTH

avatar
Explorer

Wow thanks @Geoffrey Shelton Okot and sorry for the late response. Changing the maximum memory value did the job. Now we're checking that it stays stable. So far so good!