Support Questions

Find answers, ask questions, and share your expertise

ambari + metrics collector fails

issue - metrics collector fail after some time :

we performed the following steps: ( but these steps not resolve the problem )

1. tuning the metrics conf by : https://cwiki.apache.org/confluence/display/AMBARI/Configurations+-+Tuning

2. backup /var/lib/ambari-metrics-collector/hbase-tmp ,

3. remove all under /var/lib/ambari-metrics-collector/hbase-tmp

4. start metrics collector


any other suggestions what we can do next ?



logs are attched to this thread ,



tail -f hbase-ams-master-master02.log 2018-06-04 13:17:22,377 INFO [M:0;master02:35537] regionserver.HRegionServer: ClusterId : 78d0cdb7-07a1-4153-8b79-642b783966e4 2018-06-04 13:17:28,229 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.NIOServerCnxnFactory: Accepted socket connection from /23.12.4.55:38983 2018-06-04 13:17:28,234 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.ZooKeeperServer: Client attempting to establish new session at /23.12.4.55:38983 2018-06-04 13:17:28,239 INFO [SyncThread:0] server.ZooKeeperServer: Established session 0x163caf1940f0005 with negotiated timeout 120000 for client /23.12.4.55:38983 2018-06-04 13:17:38,972 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.NIOServerCnxn: caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x163caf1940f0005, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:230) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) 2018-06-04 13:17:38,974 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.NIOServerCnxn: Closed socket connection for client /23.12.4.55:38983 which had sessionid 0x163caf1940f0005 2018-06-04 13:17:47,949 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.NIOServerCnxnFactory: Accepted socket connection from /23.12.4.55:39191 2018-06-04 13:17:47,952 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.ZooKeeperServer: Client attempting to establish new session at /23.12.4.55:39191 2018-06-04 13:17:47,956 INFO [SyncThread:0] server.ZooKeeperServer: Established session 0x163caf1940f0006 with negotiated timeout 120000 for client /23.12.4.55:39191 2018-06-04 13:17:58,642 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.NIOServerCnxn: caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x163caf1940f0006, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:230) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) 2018-06-04 13:17:58,644 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.NIOServerCnxn: Closed socket connection for client /23.12.4.55:39191 which had sessionid 0x163caf1940f0006 2018-06-04 13:18:07,584 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.NIOServerCnxnFactory: Accepted socket connection from /23.12.4.55:39264 2018-06-04 13:18:07,587 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.ZooKeeperServer: Client attempting to establish new session at /23.12.4.55:39264 2018-06-04 13:18:07,590 INFO [SyncThread:0] server.ZooKeeperServer: Established session 0x163caf1940f0007 with negotiated timeout 120000 for client /23.12.4.55:39264 2018-06-04 13:18:10,783 INFO [timeline] timeline.HadoopTimelineMetricsSink: No live collector to send metrics to. Metrics to be sent will be discarded. This message will be skipped for the next 20 times.
Michael-Bronson
12 REPLIES 12

Mentor

@Michael Bronson

The following steps would help you in cleaning up Ambari Metrics System data in a given cluster.

Important Note:

  1. Cleaning up the AMS data would remove all the historical AMS data available
  2. The hbase parameters mentioned above are specific to AMS and they are different from the Cluster Hbase parameters

Step-by-step guide

  1. Using Ambari
    1. Set AMS to maintenance
    2. Stop AMS from Ambari
    3. Identify the following from the AMS Configs screen
      1. 'Metrics Service operation mode' (embedded or distributed)
      2. hbase.rootdir
      3. hbase.zookeeper.property.dataDir
  2. AMS data would be stored in 'hbase.rootdir' identified above. Backup and remove the AMS data.
    1. If the Metrics Service operation mode
      1. is 'embedded', then the data is stored in OS files. Use regular OS commands to backup and remove the files in hbase.rootdir
      2. is 'distributed', then the data is stored in HDFS. Use 'hdfs dfs' commands to backup and remove the files in hbase.rootdir
  3. Remove the AMS zookeeper data by backing up and removing the contents of 'hbase.tmp.dir'/zookeeper
  4. Remove any Phoenix spool files from 'hbase.tmp.dir'/phoenix-spool folder
  5. Restart AMS using Ambari

Hope that helps !

@Geoffrey

I think in my case Service operation mode is embedded - hbase.rootdir is the folder = /var/var/lib/ambari-metrics-collector/hbase


so I need to remove all?


under

/var/var/lib/ambari-metrics-collector/hbase

/var/lib/ambari-metrics-collector/hbase-tmp/zookeeper

/var/lib/ambari-metrics-collector/hbase-tmp/phoenix-spool/

ls -ltr /var/var/lib/ambari-metrics-collector/hbase
total 32
-rw-r--r--.  1 ams hadoop     7 Jul 13  2017 hbase.version
-rw-r--r--.  1 ams hadoop    42 Jul 13  2017 hbase.id
drwxr-xr-x.  4 ams hadoop    32 Jul 13  2017 data
drwxr-xr-x   2 ams hadoop     6 Jul 16  2017 corrupt
drwxr-xr-x. 34 ams hadoop 24576 Jun  4 13:20 WALs
drwxr-xr-x.  2 ams hadoop    43 Jun  4 13:34 MasterProcWALs
drwxr-xr-x   2 ams hadoop     6 Jun  4 13:36 archive
drwxr-xr-x.  2 ams hadoop     6 Jun  4 13:43 oldWALs
Michael-Bronson

Mentor

@Michael Bronson

Whats your cluster size?

3 master machines , 5 workers machines , zoo are inside the masters

Michael-Bronson

BTW ignore from my previous remark ( that was deleted ) about the - hbase_master_heapsize , its not help us

Michael-Bronson

Mentor

@Michael Bronson

I looked for the thread in vain 🙂

Here is a document to help you Tuning AMS

yes we already performed that -:) ( see step 1 in my question - tuning the metrics conf by : https://cwiki.apache.org/confluence/display/AMBARI/Configurations+-+Tuning)

Michael-Bronson

now I see these warning , not seen before ( after clean all under folders )


2018-06-04 15:03:39,739 WARN [RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=48232] io.FSDataInputStreamWrapper: Failed to invoke 'unbuffer' method in class class org.apache.hadoop.fs.FSDataInputStream . So there may be a TCP socket connection left open in CLOSE_WAIT state. java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hbase.io.FSDataInputStreamWrapper.unbuffer(FSDataInputStreamWrapper.java:263) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.unbufferStream(HFileBlock.java:1788) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.unbufferStream(HFileReaderV2.java:1403) at org.apache.hadoop.hbase.io.hfile.AbstractHFileReader$Scanner.close(AbstractHFileReader.java:343) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.close(StoreFileScanner.java:252) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.close(KeyValueHeap.java:222) at org.apache.hadoop.hbase.regionserver.StoreScanner.close(StoreScanner.java:449) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.close(KeyValueHeap.java:217) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.close(HRegion.java:6198) at org.apache.phoenix.coprocessor.BaseScannerRegionObserver$2.close(BaseScannerRegionObserver.java:371) at org.apache.phoenix.coprocessor.HashJoinRegionScanner.close(HashJoinRegionScanner.java:296) at org.apache.phoenix.coprocessor.BaseScannerRegionObserver$1.close(BaseScannerRegionObserver.java:244) at org.apache.hadoop.hbase.regionserver.RSRpcServices.closeScanner(RSRpcServices.java:2717) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2674) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32385) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2150) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:167) Caused by: java.lang.UnsupportedOperationException: this stream does not support unbuffering. at org.apache.hadoop.fs.FSDataInputStream.unbuffer(FSDataInputStream.java:233) ... 22 more 2
Michael-Bronson

@Geoffrey I found also this - https://issues.apache.org/jira/browse/HADOOP-14864?attachmentSortBy=dateTime

is it something that realted to our problem ?

Michael-Bronson

after one hour that metrics collector was up now it down and we see that logs:


at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.UnsupportedOperationException: this stream does not support unbuffering.
at org.apache.hadoop.fs.FSDataInputStream.unbuffer(FSDataInputStream.java:233)
... 29 more
2018-06-04 16:19:26,938 INFO [timeline] timeline.HadoopTimelineMetricsSink: No live collector to send metrics to. Metrics to be sent will be discarded. This message will be skipped for the next 20 times.
Michael-Bronson

Expert Contributor

Michael Bronson

This is a known issue in the HBase version used by AMS in Ambari 2.6.1. Please downgrade AMS version to 2.6.0 using the following steps.

  • Update ambari.repo file on Metrics collector host to point to 2.6.0.0 release
  • yum clean all
  • Stop AMS.
  • yum remove ambari-metrics-collector
  • yum install ambari-metrics-collector
  • Verify version of AMS jar - /usr/lib/ambari-metrics-collector/ambari-metrics-*.jar
  • Start AMS.
  • Update repo file back to 2.6.1 version so that we don't disturb Ambari's setup.

There were minimal changes in AMS from 2.6.0 to 2.6.1. You can also bring back the 2.6.1 versions of ambari-metrics-* jars in /usr/lib/ambari-metrics-collector after the yum downgrade. Meaning, using newest version of AMS jars + older version of HBase.

@Aravindan Vijayan

do you mean this is known issue about the warning that I get - "WARN  [RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=48232] io.FSDataInputStreamWrapper: Failed to invoke 'unbuffer' method in class class org.apache.hadoop.fs.FSDataInputStream . So there may be a TCP socket connection left open in CLOSE_WAIT state."  ?
Michael-Bronson
Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.