Created on 07-30-2019 08:05 PM - edited 08-17-2019 04:26 PM
Good morning guys, thanks in advance for your help!
I have a project that fails. I'm trying to restart all the services manually but havent been able to.
I have a few questions and I'd really appreciate if you can give me some guidance because at this moment I'm kinda stuck.
1. How do I check what services need to be "up and running" before restarting the next one? Is there any place where I can see the dependency?
2. Do I need 2 ZooKeeper servers up and running? The first one is running in localhost but the 2nd one runs in a different machine. If I actually need them both, how can I check what was wrong in the second one?
Created on 08-03-2019 11:07 AM - edited 08-17-2019 04:26 PM
OutOfMemoryError is a subclass of java.lang.VirtualMachineError; it’s thrown by the JVM when it encounters a problem related to utilizing resources. More specifically, the error occurs when the JVM spent too much time performing Garbage Collection and was only able to reclaim very little heap space.
According to Java docs, by default, the JVM is configured to throw this error if the Java process spends more than 98% of its time doing GC and when only less than 2% of the heap is recovered in each run. In other words, this means that our application has exhausted nearly all the available memory and the Garbage Collector has spent too much time trying to clean it and failed repeatedly.
In this situation, users experience extreme slowness of the application. Certain operations, which usually complete in milliseconds, take more time to complete. This is because the CPU is using its entire capacity for Garbage Collection and hence cannot perform any other tasks.
Solution:
On HDP 3.x & 2.6.x depending on the memory available to the cluster check and increase the below
You could throttle it to 2048 MB
HTH
Created on 07-30-2019 09:05 PM - edited 08-17-2019 04:26 PM
Start-all-services-from-Ambari
Start all services. Use Ambari UI > Services > Start All to start all services at once. In Ambari UI > Services you can start, stop, and restart all listed services simultaneously. In Services, click ... and then click Start All.
The first place to check for start failures or success in /var/logs/zookeeper/zookeeper.log or zookeeper-zookeeper-server-[hostname].out
According to HWX documentation make sure to manually start the Hadoop services in this prescribed order
1. How do I check what services need to be "up and running" before restarting the next one? Is there any place where I can see the dependency?
The above gives you the list and order of dependency
2. Do I need 2 ZooKeeper servers up and running? The first one is running in localhost but the 2nd one runs in a different machine. If I actually need them both, how can I check what was wrong in the second one?
If you are not run an HA configuration a single zookeeper suffice, but if you want to emulate a production environment with many data nodes to enable [HA Namenode or RM] you MUST have at least 3 zookeepers to avoid the split-brain phenomenon
Hope that helps
Created 07-31-2019 03:11 AM
The above question was originally posted in the Community Help track. On Wed Jul 31 03:08 UTC 2019, a member of the HCC moderation staff moved it to the Cloud & Operations track. The Community Help Track is intended for questions about using the HCC site itself, not for technical questions about using Ambari.
Created 07-31-2019 01:36 PM
How many hosts do you have in your cluster? Can you share your zookeeper logs and your /etc/hosts?
HTH
Created 07-31-2019 05:32 PM
Thank again @Geoffrey Shelton Okot
I have 4 machines in the cluster (1 master 3 slaves)
Zookeeper server is installed in the master (works fine) and in one slave (fails)
This is the log I get from the slave
zookeeperLOG.txt
Created on 07-31-2019 07:21 PM - edited 08-17-2019 04:26 PM
Thanks @Geoffrey Shelton Okot for your help!
I did restart all services manually but seems that ZK still fails. From the screenshot I posted, one of my ZK servers is always down. Since ZK needs to be up and running before anything else, I'd like to fix this issue before anything else. I checked the error message from Ambari and it says
The error message says
Connection failed: [Errno 111] Connection refused to ip_zookeeper_server2:2181
What else can I do to fix this?
update:
This is what the log file shows on that machine
2019-07-31 07:57:58,187 - INFO [main:QuorumPeerConfig@103] - Reading configuration from: /usr/hdp/current/zookeeper-server/conf/zoo.cfg
2019-07-31 07:57:58,191 - WARN [main:QuorumPeerConfig@291] - No server failure will be tolerated. You need at least 3 servers.
2019-07-31 07:57:58,191 - INFO [main:QuorumPeerConfig@338] - Defaulting to majority quorums
2019-07-31 07:57:58,196 - INFO [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 30
2019-07-31 07:57:58,197 - INFO [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 24
2019-07-31 07:57:58,198 - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
2019-07-31 07:57:58,210 - INFO [main:QuorumPeerMain@127] - Starting quorum peer
2019-07-31 07:57:58,219 - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.
2019-07-31 07:57:58,223 - INFO [main:NIOServerCnxnFactory@94] - binding to port 0.0.0.0/0.0.0.0:2181
2019-07-31 07:57:58,233 - INFO [main:QuorumPeer@992] - tickTime set to 2000
2019-07-31 07:57:58,233 - INFO [main:QuorumPeer@1012] - minSessionTimeout set to -1
2019-07-31 07:57:58,233 - INFO [main:QuorumPeer@1023] - maxSessionTimeout set to -1
2019-07-31 07:57:58,233 - INFO [main:QuorumPeer@1038] - initLimit set to 10
2019-07-31 07:57:58,245 - INFO [main:FileSnap@83] - Reading snapshot /hadoop/zookeeper/version-2/snapshot.8600bc40ab
2019-07-31 07:58:41,800 - ERROR [main:NIOServerCnxnFactory$1@44] - Thread Thread[main,5,main] died
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:97)
at org.apache.zookeeper.server.DataNode.deserialize(DataNode.java:158)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
at org.apache.zookeeper.server.DataTree.deserialize(DataTree.java:1194)
at org.apache.zookeeper.server.util.SerializeUtils.deserializeSnapshot(SerializeUtils.java:127)
at org.apache.zookeeper.server.persistence.FileSnap.deserialize(FileSnap.java:127)
at org.apache.zookeeper.server.persistence.FileSnap.deserialize(FileSnap.java:87)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:483)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:473)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:153)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
Created 07-31-2019 07:21 PM
After checking today's log file I found this.
Will google it to see what it means
2019-07-31 07:57:58,187 - INFO [main:QuorumPeerConfig@103] - Reading configuration from: /usr/hdp/current/zookeeper-server/conf/zoo.cfg
2019-07-31 07:57:58,191 - WARN [main:QuorumPeerConfig@291] - No server failure will be tolerated. You need at least 3 servers.
2019-07-31 07:57:58,191 - INFO [main:QuorumPeerConfig@338] - Defaulting to majority quorums
2019-07-31 07:57:58,196 - INFO [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 30
2019-07-31 07:57:58,197 - INFO [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 24
2019-07-31 07:57:58,198 - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
2019-07-31 07:57:58,210 - INFO [main:QuorumPeerMain@127] - Starting quorum peer
2019-07-31 07:57:58,219 - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.
2019-07-31 07:57:58,223 - INFO [main:NIOServerCnxnFactory@94] - binding to port 0.0.0.0/0.0.0.0:2181
2019-07-31 07:57:58,233 - INFO [main:QuorumPeer@992] - tickTime set to 2000
2019-07-31 07:57:58,233 - INFO [main:QuorumPeer@1012] - minSessionTimeout set to -1
2019-07-31 07:57:58,233 - INFO [main:QuorumPeer@1023] - maxSessionTimeout set to -1
2019-07-31 07:57:58,233 - INFO [main:QuorumPeer@1038] - initLimit set to 10
2019-07-31 07:57:58,245 - INFO [main:FileSnap@83] - Reading snapshot /hadoop/zookeeper/version-2/snapshot.8600bc40ab
2019-07-31 07:58:41,800 - ERROR [main:NIOServerCnxnFactory$1@44] - Thread Thread[main,5,main] died
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:97)
at org.apache.zookeeper.server.DataNode.deserialize(DataNode.java:158)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
at org.apache.zookeeper.server.DataTree.deserialize(DataTree.java:1194)
at org.apache.zookeeper.server.util.SerializeUtils.deserializeSnapshot(SerializeUtils.java:127)
at org.apache.zookeeper.server.persistence.FileSnap.deserialize(FileSnap.java:127)
at org.apache.zookeeper.server.persistence.FileSnap.deserialize(FileSnap.java:87)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:130)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:483)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:473)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:153)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
Created on 08-03-2019 11:07 AM - edited 08-17-2019 04:26 PM
OutOfMemoryError is a subclass of java.lang.VirtualMachineError; it’s thrown by the JVM when it encounters a problem related to utilizing resources. More specifically, the error occurs when the JVM spent too much time performing Garbage Collection and was only able to reclaim very little heap space.
According to Java docs, by default, the JVM is configured to throw this error if the Java process spends more than 98% of its time doing GC and when only less than 2% of the heap is recovered in each run. In other words, this means that our application has exhausted nearly all the available memory and the Garbage Collector has spent too much time trying to clean it and failed repeatedly.
In this situation, users experience extreme slowness of the application. Certain operations, which usually complete in milliseconds, take more time to complete. This is because the CPU is using its entire capacity for Garbage Collection and hence cannot perform any other tasks.
Solution:
On HDP 3.x & 2.6.x depending on the memory available to the cluster check and increase the below
You could throttle it to 2048 MB
HTH
Created 08-11-2019 06:16 PM
The error in BOLD below is what I stated in Question/Answer 2 in my former post. To avoid the split-brain decision you MUST install 3 zookeepers
2019-07-31 07:57:58,191 - WARN [main:QuorumPeerConfig@291] - No server failure will be tolerated. You need at least 3 servers.
Solution
Delete/remove the failed installation.
Add 2 new zk using Ambari UI in your cluster using ADD SERVICE, start the new zookeepers if they ain't started, this should form a quorum where only one is a leader and the rest are followers. To identify a Zookeeper leader/follower, there are few possible options. Mentioning 2 for keeping this document simple. 2. Use "nc" command to listen to TCP communication on port 2181 and determine if the ZooKeeper server is a leader or a follower.
1. Check the zookeeper log file on each node, and grep as below:
# grep LEAD /var/log/zookeeper/zookeeper-zookeeper-server-xyz.out
Desired output
2019-08-10 22:33:47,113 - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:QuorumPeer@829] - LEADING
2019-08-10 22:33:47,114 - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Leader@358] - LEADING - LEADER ELECTION TOOK - 9066
After doing the above procedure you should be good to go.
HTH
Created 08-13-2019 03:55 PM
Wow thanks @Geoffrey Shelton Okot and sorry for the late response. Changing the maximum memory value did the job. Now we're checking that it stays stable. So far so good!