Support Questions

diablo2 · ‎02-07-2017

Apache Ambari Version2.4.0.1

The alterts indicated CRITIAL because of

Metrics Collector has been auto-started 5 times since <timestamp>

This happened very very frequently.

The mainly log ambari-metrics-collector.loglike below

2017-02-07 11:48:42,465 WARN org.apache.zookeeper.ClientCnxn: Session 0x15a1698abc40001 for server humepcomp117.huawei.com/10.106.134.117:61181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
 at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
 at sun.nio.ch.IOUtil.read(IOUtil.java:192)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
.
.
2017-02-07 11:48:45,336 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=humepcomp117.huawei.com:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/table/SYSTEM.CATALOG
2017-02-07 11:48:45,336 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=humepcomp117.huawei.com:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure
.
.
2017-02-07 11:49:02,678 ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper getData failed after 4 attempts
2017-02-07 11:49:02,678 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: hconnection-0x433defed-0x15a1698abc40001, quorum=humepcomp117.huawei.com:61181, baseZNode=/ams-hbase-unsecure Unable to get data of znode /ams-hbase-unsecure/table/METRIC_RECORD
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/table/METRIC_RECORD
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
 at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)
 at org.apache.hadoop.hbase.zookeeper.ZKUtil.getData(ZKUtil.java:622)
 at org.apache.hadoop.hbase.zookeeper.ZKTableStateClientSideReader.getTableState(ZKTableStateClientSideReader.java:185)
 at org.apache.hadoop.hbase.zookeeper.ZKTableStateClientSideReader.isDisabledTable(ZKTableStateClientSideReader.java:59)
 at org.apache.hadoop.hbase.client.ZooKeeperRegistry.isTableOnlineState(ZooKeeperRegistry.java:127)
 at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.isTableDisabled(ConnectionManager.java:960)
 at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.relocateRegion(ConnectionManager.java:1129)
 at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:298)
 at org.apache.hadoop.hbase.client.ScannerCallable.prepare(ScannerCallable.java:150)
 at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.prepare(ScannerCallableWithReplicas.java:376)
 at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:124)
 at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:65)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

and hbase-ams-master-humepcomp117.log

2017-02-07 11:48:43,542 INFO  [main] zookeeper.ZooKeeper: Initiating client connection, connectString=humepcomp117.huawei.com:61181 sessionTimeout=120000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@4450d156
2017-02-07 11:48:43,583 INFO  [main-SendThread(humepcomp117.huawei.com:61181)] zookeeper.ClientCnxn: Opening socket connection to server humepcomp117.huawei.com/10.106.134.117:61181. Will not attempt to authenticate using SASL (unknown error)
2017-02-07 11:48:43,592 WARN  [main-SendThread(humepcomp117.huawei.com:61181)] zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2017-02-07 11:48:43,712 WARN  [main] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=humepcomp117.huawei.com:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/master
2017-02-07 11:48:44,702 INFO  [main-SendThread(humepcomp117.huawei.com:61181)] zookeeper.ClientCnxn: Opening socket connection to server humepcomp117.huawei.com/10.106.134.117:61181. Will not attempt to authenticate using SASL (unknown error)
2017-02-07 11:48:44,704 WARN  [main-SendThread(humepcomp117.huawei.com:61181)] zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
 at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2017-02-07 11:48:44,805 WARN  [main] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=humepcomp117.huawei.com:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/master
2017-02-07 11:48:44,805 ERROR [main] zookeeper.RecoverableZooKeeper: ZooKeeper getData failed after 1 attempts
2017-02-07 11:48:44,805 WARN  [main] zookeeper.ZKUtil: clean znode for master0x0, quorum=humepcomp117.huawei.com:61181, baseZNode=/ams-hbase-unsecure Unable to get data of znode /ams-hbase-unsecure/master
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/master
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
 at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)
 at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:712)
 at org.apache.hadoop.hbase.zookeeper.MasterAddressTracker.deleteIfEquals(MasterAddressTracker.java:267)
 at org.apache.hadoop.hbase.ZNodeClearer.clear(ZNodeClearer.java:149)
 at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:143)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
 at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
 at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2529)
2017-02-07 11:48:44,807 ERROR [main] zookeeper.ZooKeeperWatcher: clean znode for master0x0, quorum=humepcomp117.huawei.com:61181, baseZNode=/ams-hbase-unsecure Received unexpected KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/master
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
 at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)
 at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:712)
 at org.apache.hadoop.hbase.zookeeper.MasterAddressTracker.deleteIfEquals(MasterAddressTracker.java:267)
 at org.apache.hadoop.hbase.ZNodeClearer.clear(ZNodeClearer.java:149)
 at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:143)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
 at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
 at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2529)
2017-02-07 11:48:44,809 WARN  [main] zookeeper.ZooKeeperNodeTracker: Can't get or delete the master znode
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/master
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
 at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)
 at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:712)
 at org.apache.hadoop.hbase.zookeeper.MasterAddressTracker.deleteIfEquals(MasterAddressTracker.java:267)
 at org.apache.hadoop.hbase.ZNodeClearer.clear(ZNodeClearer.java:149)
 at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:143)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
 at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
 at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2529)
2017-02-07 11:48:45,805 INFO  [main-SendThread(humepcomp117.huawei.com:61181)] zookeeper.ClientCnxn: Opening socket connection to server humepcomp117.huawei.com/10.106.134.117:61181. Will not attempt to authenticate using SASL (unknown error)
2017-02-07 11:48:45,911 INFO  [main] zookeeper.ZooKeeper: Session: 0x0 closed
2017-02-07 11:48:45,911 INFO  [main-EventThread] zookeeper.ClientCnxn: EventThread shut down

Due to networklimitations of the company, I can't paste or upload full log file, I will try to paste in Follow-up posts.

Please kindly help to solve this problem.Thanks.

I had tryied the best answer in

https://community.hortonworks.com/questions/48107/ambari-metrics-collector.html

, but no help.

Thanks for your great support.

jsensharma · ‎02-10-2017

@Huahua Wei

Ideally the "MaxMetaspaceSize" has no upper limit. Please see:

http://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/considerations.html

The amount of native memory that can be used for class metadata is by default unlimited. Use the option MaxMetaspaceSize to put an upper limit on the amount of native memory used for class metadata.

.

Regarding the tuning parameter you might want to tune it as following:

Something you should try with this:
----------------------------------------
ams-hbase-env::hbase_master_heapsize 1152 MB ===>> 8192 MB 
ams-hbase-env::hbase_master_maxperm_size 128 MB ===>> 128 MB  (or 256 MB)
ams-hbase-env::hbase_regionserver_heapsize 768 MB ===>> 8192 MB 
ams-hbase-env::regionserver_xmn_size 128 MB ===>> 1280 MB  to 1536 MB

.

In JDK 1.8 PermGen space is replaced with MetaSpace. And it is always better to set the "MaxMetaspaceSize" so that if there is any classloader leak then it will not grow beyond the MaxMetaspaceSize boundary else it may cause huge system memory utilization (in case of leak).

And also disable (exclude) HBase per region metrics to avoid data flooding.

View solution in original post

diablo2 · ‎02-10-2017

Hi all, my last change on following seems solved issue, the alter not happened for more than 24 hours.

ams-hbase-env::hbase_master_heapsize 1152 MB ===>> 8192 MB ams-hbase-env::hbase_master_maxperm_size 128 MB ===>> 8192 MB ams-hbase-env::hbase_regionserver_heapsize 768 MB ===>> 8192 MB ams-hbase-env::regionserver_xmn_size 128 MB ===>> 256 MB

ams-site::phoenix.query.maxGlobalMemoryPercentage 25 ===>> 30 ams-site::timeline.metrics.cluster.aggregate.splitpoints mapred.ShuffleMetrics.ShuffleOutputsFailed ===>> dfs.namenode.FilesInGetListingOps,kafka.network.RequestMetrics.RequestsPerSec.request.LeaderAndIsr.count,master.Master.numActiveHandler,regionserver.Server.checkMutatePassedCount ams-site::timeline.metrics.host.aggregate.splitpoints mapred.ShuffleMetrics.ShuffleOutputsFailed ===>> EventDrainSuccessCount,bytes_out,default.StartupProgress.LoadingEditsPercentComplete,dfs.datanode.BlockChecksumOpAvgTime,dfs.datanode.ReplaceBlockOpAvgTime,dfs.namenode.SyncsNumOps,ipc.IPC.QueueCallTime_num_ops,jvm.JvmMetrics.MemHeapMaxM,kafka.controller.ControllerStats.LeaderElectionRateAndTimeMs.95percentile,kafka.network.RequestMetrics.RequestsPerSec.request.ConsumerMetadata.5MinuteRate,kafka.network.RequestMetrics.RequestsPerSec.request.JoinGroup.5MinuteRate,kafka.network.RequestMetrics.RequestsPerSec.request.Produce.5MinuteRate,kafka.network.SocketServer.IdlePercent.networkProcessor.2.meanRate,load_one,master.Balancer.BalancerCluster_median,master.FileSystem.MetaHlogSplitTime_75th_percentile,master.Server.averageLoad,metricssystem.MetricsSystem.NumActiveSources,read_bps,regionserver.Replication.sink.ageOfLastAppliedOp,regionserver.Server.Increment_75th_percentile,regionserver.Server.SplitTime_max,regionserver.Server.splitSuccessCount,rpc.rpc.SentBytes,rpcdetailed.rpcdetailed.GetQueueUserAclsAvgTime,write_bytes

sidwagle · ‎02-10-2017

XMN should be 15 % of Xmx, suggest setting this value to at 1 GB. maxperm does not need to be more than 128M

diablo2 · ‎02-10-2017

Which two? I'm not fimiliar with JVM.

sidwagle · ‎02-10-2017

In embedded mode we combine region server and hbase heapsize since there is only 1 daemon running.

Xmn settings:
regionserver_xmn_size
hbase_master_xmn_size

JDK8 perm setting is unnecessary since it is offheap, so effectively the setting itself doesn't matter.

jsensharma · ‎02-10-2017

@Huahua Wei

Good to know that you are not getting alert frequently. However i would like to add about one eye catching setting that you have made.

ams-hbase-env::hbase_master_maxperm_size 128 MB ===>> 8192 MB  (too High value)

PermGen: It space stands for memory allocation for Permanent generation All Java immutable objects come under this category, like String which is created with literals or with String.intern() methods and for loading the classes into memory

Basically class definitions are loaded in this area of JVM. 8192 is much much high value. AMS Hbase master does not consume that much memory. So you may reduce that value to at least 8 times lower 🙂

If you want to check how much permgen was utilized at any point of time then you can do the following:

# su - ams
# ps -ef | grep ^ams | grep HMaster
# /JAVA_HOME/bin/jmap  -heap   $PID_HMASTER
.
.

OUTPUT (Example)
PS Perm Generation
   capacity = 57671680 (55.0MB)
   used     = 41699008 (39.76727294921875MB)
   free     = 15972672 (15.23272705078125MB)
   72.30413263494319% used

.

NOTE: From JDK 1.8 onwards that "PermGen" space is replaced with "MetaSpace".

diablo2 · ‎02-10-2017

@Jay SenSharma no "MetaSpace" can't be found in my env.

# ps -ef |grep ^ams|grep HMaster
ams       74758  74744 99 Feb09 ?        1-14:46:59 /usr/local/java/jdk1.8.0_101/bin/java -Dproc_master -XX:OnOutOfMemoryError=kill -9 %p -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/ambari-metrics-collector/hs_err_pid%p.log -Djava.io.tmpdir=/usr/var/lib/ambari-metrics-collector/hbase-tmp -Djava.library.path=/usr/lib/ams-hbase/lib/hadoop-native/ -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/ambari-metrics-collector/gc.log-201702090946 -Xms16384m -Xmx16384m -Xmn320m -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -Dhbase.log.dir=/var/log/ambari-metrics-collector -Dhbase.log.file=hbase-ams-master-humepcomp117.log -Dhbase.home.dir=/usr/lib/ams-hbase/ -Dhbase.id.str=ams -Dhbase.root.logger=INFO,RFA -Dhbase.security.logger=INFO,RFAS org.apache.hadoop.hbase.master.HMaster start
# jmap -heap 74758              
Attaching to process ID 74758, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.101-b13
using parallel threads in the new generation.
using thread-local object allocation.
Concurrent Mark-Sweep GC
Heap Configuration:
   MinHeapFreeRatio         = 40
   MaxHeapFreeRatio         = 70
   MaxHeapSize              = 17179869184 (16384.0MB)
   NewSize                  = 335544320 (320.0MB)
   MaxNewSize               = 335544320 (320.0MB)
   OldSize                  = 16844324864 (16064.0MB)
   NewRatio                 = 2
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 0 (0.0MB)
Heap Usage:
New Generation (Eden + 1 Survivor Space):
   capacity = 301989888 (288.0MB)
   used     = 237059568 (226.07762145996094MB)
   free     = 64930320 (61.92237854003906MB)
   78.49917411804199% used
Eden Space:
   capacity = 268435456 (256.0MB)
   used     = 203505136 (194.07762145996094MB)
   free     = 64930320 (61.92237854003906MB)
   75.81157088279724% used
From Space:
   capacity = 33554432 (32.0MB)
   used     = 33554432 (32.0MB)
   free     = 0 (0.0MB)
   100.0% used
To Space:
   capacity = 33554432 (32.0MB)
   used     = 0 (0.0MB)
   free     = 33554432 (32.0MB)
   0.0% used
concurrent mark-sweep generation:
   capacity = 16844324864 (16064.0MB)
   used     = 10044724000 (9579.395294189453MB)
   free     = 6799600864 (6484.604705810547MB)
   59.63268982936662% used
16850 interned Strings occupying 1609888 bytes.

jsensharma · ‎02-10-2017

@Huahua Wei

Yes it is right there in your output :

MetaspaceSize            = 21807104 (20.796875MB)     
CompressedClassSpaceSize = 1073741824 (1024.0MB)
MaxMetaspaceSize         = 17592186044415 MB

.

MetaSpace is not displayed same as the "PermGen" used to be present in the Java7 (or previous version of java) jmap output. But you can use some profilers to monitor it.

Or enable garbage collection logging to get the Garbage collection information of your JVM that will also show the metaspace current usage.

Example GC log:

   [GC (Allocation Failure) 
      1.055: [DefNew: 861K->64K(960K), 0.0008510 secs]
      1.056: [Tenured: 1226K->1279K(1280K), 0.0009817 secs]
      1303K->1290K(2240K), 
      [Metaspace: 44K->44K(4480K)],     ------> NOTICE
      0.0019995 secs]

.

diablo2 · ‎02-10-2017

@Jay SenSharma Why this is so big?

MaxMetaspaceSize=17592186044415 MB

jsensharma · ‎02-10-2017

@Huahua Wei

Ideally the "MaxMetaspaceSize" has no upper limit. Please see:

http://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/considerations.html

The amount of native memory that can be used for class metadata is by default unlimited. Use the option MaxMetaspaceSize to put an upper limit on the amount of native memory used for class metadata.

.

Regarding the tuning parameter you might want to tune it as following:

Something you should try with this:
----------------------------------------
ams-hbase-env::hbase_master_heapsize 1152 MB ===>> 8192 MB 
ams-hbase-env::hbase_master_maxperm_size 128 MB ===>> 128 MB  (or 256 MB)
ams-hbase-env::hbase_regionserver_heapsize 768 MB ===>> 8192 MB 
ams-hbase-env::regionserver_xmn_size 128 MB ===>> 1280 MB  to 1536 MB

.

In JDK 1.8 PermGen space is replaced with MetaSpace. And it is always better to set the "MaxMetaspaceSize" so that if there is any classloader leak then it will not grow beyond the MaxMetaspaceSize boundary else it may cause huge system memory utilization (in case of leak).

And also disable (exclude) HBase per region metrics to avoid data flooding.

diablo2 · ‎02-15-2017

Finally, I setted as @Jay SenSharma suggested aboved. Working well. Thanks