Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

ambari + metrics collector fails

ambari + metrics collector fails

issue - metrics collector fail after some time :

we performed the following steps: ( but these steps not resolve the problem )

1. tuning the metrics conf by : https://cwiki.apache.org/confluence/display/AMBARI/Configurations+-+Tuning

2. backup /var/lib/ambari-metrics-collector/hbase-tmp ,

3. remove all under /var/lib/ambari-metrics-collector/hbase-tmp

4. start metrics collector


any other suggestions what we can do next ?



logs are attched to this thread ,



tail -f hbase-ams-master-master02.log 2018-06-04 13:17:22,377 INFO [M:0;master02:35537] regionserver.HRegionServer: ClusterId : 78d0cdb7-07a1-4153-8b79-642b783966e4 2018-06-04 13:17:28,229 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.NIOServerCnxnFactory: Accepted socket connection from /23.12.4.55:38983 2018-06-04 13:17:28,234 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.ZooKeeperServer: Client attempting to establish new session at /23.12.4.55:38983 2018-06-04 13:17:28,239 INFO [SyncThread:0] server.ZooKeeperServer: Established session 0x163caf1940f0005 with negotiated timeout 120000 for client /23.12.4.55:38983 2018-06-04 13:17:38,972 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.NIOServerCnxn: caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x163caf1940f0005, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:230) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) 2018-06-04 13:17:38,974 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.NIOServerCnxn: Closed socket connection for client /23.12.4.55:38983 which had sessionid 0x163caf1940f0005 2018-06-04 13:17:47,949 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.NIOServerCnxnFactory: Accepted socket connection from /23.12.4.55:39191 2018-06-04 13:17:47,952 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.ZooKeeperServer: Client attempting to establish new session at /23.12.4.55:39191 2018-06-04 13:17:47,956 INFO [SyncThread:0] server.ZooKeeperServer: Established session 0x163caf1940f0006 with negotiated timeout 120000 for client /23.12.4.55:39191 2018-06-04 13:17:58,642 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.NIOServerCnxn: caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x163caf1940f0006, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:230) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) 2018-06-04 13:17:58,644 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.NIOServerCnxn: Closed socket connection for client /23.12.4.55:39191 which had sessionid 0x163caf1940f0006 2018-06-04 13:18:07,584 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.NIOServerCnxnFactory: Accepted socket connection from /23.12.4.55:39264 2018-06-04 13:18:07,587 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.ZooKeeperServer: Client attempting to establish new session at /23.12.4.55:39264 2018-06-04 13:18:07,590 INFO [SyncThread:0] server.ZooKeeperServer: Established session 0x163caf1940f0007 with negotiated timeout 120000 for client /23.12.4.55:39264 2018-06-04 13:18:10,783 INFO [timeline] timeline.HadoopTimelineMetricsSink: No live collector to send metrics to. Metrics to be sent will be discarded. This message will be skipped for the next 20 times.
Michael-Bronson
12 REPLIES 12

Re: ambari + metrics collector fails

Mentor

@Michael Bronson

The following steps would help you in cleaning up Ambari Metrics System data in a given cluster.

Important Note:

  1. Cleaning up the AMS data would remove all the historical AMS data available
  2. The hbase parameters mentioned above are specific to AMS and they are different from the Cluster Hbase parameters

Step-by-step guide

  1. Using Ambari
    1. Set AMS to maintenance
    2. Stop AMS from Ambari
    3. Identify the following from the AMS Configs screen
      1. 'Metrics Service operation mode' (embedded or distributed)
      2. hbase.rootdir
      3. hbase.zookeeper.property.dataDir
  2. AMS data would be stored in 'hbase.rootdir' identified above. Backup and remove the AMS data.
    1. If the Metrics Service operation mode
      1. is 'embedded', then the data is stored in OS files. Use regular OS commands to backup and remove the files in hbase.rootdir
      2. is 'distributed', then the data is stored in HDFS. Use 'hdfs dfs' commands to backup and remove the files in hbase.rootdir
  3. Remove the AMS zookeeper data by backing up and removing the contents of 'hbase.tmp.dir'/zookeeper
  4. Remove any Phoenix spool files from 'hbase.tmp.dir'/phoenix-spool folder
  5. Restart AMS using Ambari

Hope that helps !

Re: ambari + metrics collector fails

@Geoffrey

I think in my case Service operation mode is embedded - hbase.rootdir is the folder = /var/var/lib/ambari-metrics-collector/hbase


so I need to remove all?


under

/var/var/lib/ambari-metrics-collector/hbase

/var/lib/ambari-metrics-collector/hbase-tmp/zookeeper

/var/lib/ambari-metrics-collector/hbase-tmp/phoenix-spool/

ls -ltr /var/var/lib/ambari-metrics-collector/hbase
total 32
-rw-r--r--.  1 ams hadoop     7 Jul 13  2017 hbase.version
-rw-r--r--.  1 ams hadoop    42 Jul 13  2017 hbase.id
drwxr-xr-x.  4 ams hadoop    32 Jul 13  2017 data
drwxr-xr-x   2 ams hadoop     6 Jul 16  2017 corrupt
drwxr-xr-x. 34 ams hadoop 24576 Jun  4 13:20 WALs
drwxr-xr-x.  2 ams hadoop    43 Jun  4 13:34 MasterProcWALs
drwxr-xr-x   2 ams hadoop     6 Jun  4 13:36 archive
drwxr-xr-x.  2 ams hadoop     6 Jun  4 13:43 oldWALs
Michael-Bronson

Re: ambari + metrics collector fails

Mentor

@Michael Bronson

Whats your cluster size?

Re: ambari + metrics collector fails

3 master machines , 5 workers machines , zoo are inside the masters

Michael-Bronson

Re: ambari + metrics collector fails

BTW ignore from my previous remark ( that was deleted ) about the - hbase_master_heapsize , its not help us

Michael-Bronson

Re: ambari + metrics collector fails

Mentor

@Michael Bronson

I looked for the thread in vain :-)

Here is a document to help you Tuning AMS

Re: ambari + metrics collector fails

yes we already performed that -:) ( see step 1 in my question - tuning the metrics conf by : https://cwiki.apache.org/confluence/display/AMBARI/Configurations+-+Tuning)

Michael-Bronson

Re: ambari + metrics collector fails

now I see these warning , not seen before ( after clean all under folders )


2018-06-04 15:03:39,739 WARN [RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=48232] io.FSDataInputStreamWrapper: Failed to invoke 'unbuffer' method in class class org.apache.hadoop.fs.FSDataInputStream . So there may be a TCP socket connection left open in CLOSE_WAIT state. java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hbase.io.FSDataInputStreamWrapper.unbuffer(FSDataInputStreamWrapper.java:263) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.unbufferStream(HFileBlock.java:1788) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.unbufferStream(HFileReaderV2.java:1403) at org.apache.hadoop.hbase.io.hfile.AbstractHFileReader$Scanner.close(AbstractHFileReader.java:343) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.close(StoreFileScanner.java:252) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.close(KeyValueHeap.java:222) at org.apache.hadoop.hbase.regionserver.StoreScanner.close(StoreScanner.java:449) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.close(KeyValueHeap.java:217) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.close(HRegion.java:6198) at org.apache.phoenix.coprocessor.BaseScannerRegionObserver$2.close(BaseScannerRegionObserver.java:371) at org.apache.phoenix.coprocessor.HashJoinRegionScanner.close(HashJoinRegionScanner.java:296) at org.apache.phoenix.coprocessor.BaseScannerRegionObserver$1.close(BaseScannerRegionObserver.java:244) at org.apache.hadoop.hbase.regionserver.RSRpcServices.closeScanner(RSRpcServices.java:2717) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2674) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32385) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2150) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:167) Caused by: java.lang.UnsupportedOperationException: this stream does not support unbuffering. at org.apache.hadoop.fs.FSDataInputStream.unbuffer(FSDataInputStream.java:233) ... 22 more 2
Michael-Bronson

Re: ambari + metrics collector fails

@Geoffrey I found also this - https://issues.apache.org/jira/browse/HADOOP-14864?attachmentSortBy=dateTime

is it something that realted to our problem ?

Michael-Bronson