Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

metrics collector crash after moving to distributed mode and after metrics cleaning

Highlighted

metrics collector crash after moving to distributed mode and after metrics cleaning

hi all

 

metrics collector crash after moving to distributed mode and after metrics cleaning

 

metrics collector failed after some time and we get the following errors

 

2020-02-23 13:44:53,163 INFO org.apache.zookeeper.ZooKeeper: Session: 0x170721fadf90010 closed
2020-02-23 13:44:53,163 INFO org.apache.helix.manager.zk.ZkClient: Closed zkclient
2020-02-23 13:44:53,163 INFO org.apache.helix.manager.zk.ZKHelixManager: Cluster manager: master.sys9378.com disconnected
2020-02-23 13:44:53,163 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down for session: 0x170721fadf90010
2020-02-23 13:44:53,166 WARN org.apache.helix.controller.GenericHelixController: ClusterEventProcessor interrupted
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at org.apache.helix.controller.stages.ClusterEventBlockingQueue.take(ClusterEventBlockingQueue.java:85)
at org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:594)
2020-02-23 13:44:53,166 INFO org.apache.helix.controller.GenericHelixController: END ClusterEventProcessor thread
2020-02-23 13:44:53,173 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping phoenix metrics system...
2020-02-23 13:44:53,181 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: phoenix metrics system stopped.
2020-02-23 13:44:53,181 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: phoenix metrics system shutdown complete.
2020-02-23 13:44:53,182 INFO org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl: Stopping ApplicationHistory
2020-02-23 13:44:53,183 INFO org.apache.hadoop.ipc.Server: Stopping server on 60200
2020-02-23 13:44:53,191 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2020-02-23 13:44:53,191 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 60200
2020-02-23 13:44:53,206 INFO org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down ApplicationHistoryServer at master.sys9378.com/42.3.44..126
************************************************************/

 

not clearly why we get this errors ,because we clean all metrics data as

 

from temp folder

 

/hadoop/var/lib/ambari-metrics-collector

 

from HDFS

 

hdfs dfs -rm -r -f /user/ams/hbase/*

 

and from zoocli

 

[zk: 127.0.0.1:2181(CONNECTED) 3] rmr /ams-hbase-unsecure

 

Michael-Bronson
4 REPLIES 4

Re: metrics collector crash after moving to distributed mode and after metrics cleaning

Super Mentor

@mike_bronson7 

As you are running AMS in distributed mode hence it will be good to see any error appearing in the AMS-Hbase-Master logs first, because of the AMS HMaster process will not run successfully then AMS collector will definitely go down.

 

So we should see all these logs for any errors:

/var/log/ambari-metrics-collector/hbase-ams-master-*.log
/var/log/ambari-metrics-collector/hbase-ams-region-*.log
/var/log/ambari-metrics-collector/ambari-metrics-collector.log

After freshly restarting AMS service what is the first error that you see in the  "hbase-ams-master-xxx.log"  and in "ambari-metrics-collector.log"?

 

 

Highlighted

Re: metrics collector crash after moving to distributed mode and after metrics cleaning

Dear Jay

 

one of the logs

 


tail -f ambari-metrics-collector.log

,612 INFO org.apache.helix.monitoring.mbeans.ClusterStatusMonitor: Unregistering ClusterStatus: cluster=ambari-metrics-cluster,instanceName=master02.sys67.com_12001,resourceName=METRIC_AGGREGATORS
2020-02-25 17:55:36,612 INFO org.apache.helix.monitoring.mbeans.ClusterStatusMonitor: Unregistering ClusterStatus: cluster=ambari-metrics-cluster
2020-02-25 17:55:36,612 INFO org.apache.helix.manager.zk.CallbackHandler: 117 END:INVOKE /ambari-metrics-cluster/CONTROLLER listener:org.apache.helix.manager.zk.DistributedLeaderElection Took: 4ms
2020-02-25 17:55:36,612 INFO org.apache.helix.manager.zk.ZkClient: Closing zkclient: State:CONNECTED Timeout:30000 sessionid:0x1707d6c38190018 local:/43.6.53.28:45161 remoteserver:master01.sys67.com/43.6.53.27:2181 lastZxid:12249246728803 xid:533 sent:537 recv:547 queuedpkts:1 pendingresp:0 queuedevents:0
2020-02-25 17:55:36,613 WARN org.apache.helix.manager.zk.CallbackHandler: Skip processing callbacks for listener: org.apache.helix.controller.GenericHelixController@61dde151, path: /ambari-metrics-cluster/LIVEINSTANCES, expected types: [INIT] but was CALLBACK
2020-02-25 17:55:36,629 INFO org.apache.helix.controller.GenericHelixController: Get FINALIZE notification, skip the pipeline. Event :idealStateChange
2020-02-25 17:55:36,643 INFO org.apache.zookeeper.ZooKeeper: Session: 0x1707d6c38190018 closed
2020-02-25 17:55:36,643 INFO org.apache.helix.manager.zk.ZkClient: Closed zkclient
2020-02-25 17:55:36,643 INFO org.apache.helix.manager.zk.ZKHelixManager: Cluster manager: master02.sys67.com disconnected
2020-02-25 17:55:36,643 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down for session: 0x1707d6c38190018
2020-02-25 17:55:36,646 WARN org.apache.helix.controller.GenericHelixController: ClusterEventProcessor interrupted
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at org.apache.helix.controller.stages.ClusterEventBlockingQueue.take(ClusterEventBlockingQueue.java:85)
at org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:594)
2020-02-25 17:55:36,647 INFO org.apache.helix.controller.GenericHelixController: END ClusterEventProcessor thread
2020-02-25 17:55:36,671 INFO org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down ApplicationHistoryServer at master02.sys67.com/43.6.53.28
************************************************************/

 

Michael-Bronson
Highlighted

Re: metrics collector crash after moving to distributed mode and after metrics cleaning

 

from -  hbase-ams-master-master02

 

 

Tue Feb 25 18:07:05 UTC 2020, null, java.net.SocketTimeoutException: callTimeout=300000, callDuration=310495: row '' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=master02.sys67.com,61320,1582652433392, seqNum=0

at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:210)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:327)
at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:302)
at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:167)
at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:162)
at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:794)
at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:193)
at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:89)
at org.apache.hadoop.hbase.master.CatalogJanitor.getMergedRegionsAndSplitParents(CatalogJanitor.java:184)
at org.apache.hadoop.hbase.master.CatalogJanitor.getMergedRegionsAndSplitParents(CatalogJanitor.java:136)
at org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:238)
at org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:118)
at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:185)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketTimeoutException: callTimeout=300000, callDuration=310495: row '' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=master02.sys67.com,61320,1582652433392, seqNum=0
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159)
at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:65)
... 3 more
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupConnection(RpcClientImpl.java:410)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:716)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:887)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:856)
at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1199)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:32831)
at org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:379)
at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:201)
at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:63)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:364)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:338)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
... 4 more
2020-02-25 18:07:05,670 INFO [main-EventThread] coordination.SplitLogManagerCoordination: task /ams-hbase-unsecure/splitWAL/RESCAN0000000390 entered state: DONE master02.sys67.com,61300,1582652432230
2020-02-25 18:07:06,545 INFO [main-EventThread] coordination.SplitLogManagerCoordination: task /ams-hbase-unsecure/splitWAL/RESCAN0000000391 entered state: DONE master02.sys67.com,61300,1582652432230

Michael-Bronson
Highlighted

Re: metrics collector crash after moving to distributed mode and after metrics cleaning

another thing

 

ls -ltr /var/lib/ambari-metrics-collector/hbase-tmp/
total 0

 

is empty ( since we remove the data from this folder for cleaning )

Michael-Bronson
Don't have an account?
Coming from Hortonworks? Activate your account here