Created 06-04-2018 12:51 PM
issue - metrics collector fail after some time :
we performed the following steps: ( but these steps not resolve the problem )
1. tuning the metrics conf by : https://cwiki.apache.org/confluence/display/AMBARI/Configurations+-+Tuning
2. backup /var/lib/ambari-metrics-collector/hbase-tmp ,
3. remove all under /var/lib/ambari-metrics-collector/hbase-tmp
4. start metrics collector
any other suggestions what we can do next ?
logs are attched to this thread ,
tail -f hbase-ams-master-master02.log 2018-06-04 13:17:22,377 INFO [M:0;master02:35537] regionserver.HRegionServer: ClusterId : 78d0cdb7-07a1-4153-8b79-642b783966e4 2018-06-04 13:17:28,229 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.NIOServerCnxnFactory: Accepted socket connection from /23.12.4.55:38983 2018-06-04 13:17:28,234 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.ZooKeeperServer: Client attempting to establish new session at /23.12.4.55:38983 2018-06-04 13:17:28,239 INFO [SyncThread:0] server.ZooKeeperServer: Established session 0x163caf1940f0005 with negotiated timeout 120000 for client /23.12.4.55:38983 2018-06-04 13:17:38,972 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.NIOServerCnxn: caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x163caf1940f0005, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:230) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) 2018-06-04 13:17:38,974 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.NIOServerCnxn: Closed socket connection for client /23.12.4.55:38983 which had sessionid 0x163caf1940f0005 2018-06-04 13:17:47,949 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.NIOServerCnxnFactory: Accepted socket connection from /23.12.4.55:39191 2018-06-04 13:17:47,952 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.ZooKeeperServer: Client attempting to establish new session at /23.12.4.55:39191 2018-06-04 13:17:47,956 INFO [SyncThread:0] server.ZooKeeperServer: Established session 0x163caf1940f0006 with negotiated timeout 120000 for client /23.12.4.55:39191 2018-06-04 13:17:58,642 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.NIOServerCnxn: caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x163caf1940f0006, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:230) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:745) 2018-06-04 13:17:58,644 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.NIOServerCnxn: Closed socket connection for client /23.12.4.55:39191 which had sessionid 0x163caf1940f0006 2018-06-04 13:18:07,584 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.NIOServerCnxnFactory: Accepted socket connection from /23.12.4.55:39264 2018-06-04 13:18:07,587 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:61181] server.ZooKeeperServer: Client attempting to establish new session at /23.12.4.55:39264 2018-06-04 13:18:07,590 INFO [SyncThread:0] server.ZooKeeperServer: Established session 0x163caf1940f0007 with negotiated timeout 120000 for client /23.12.4.55:39264 2018-06-04 13:18:10,783 INFO [timeline] timeline.HadoopTimelineMetricsSink: No live collector to send metrics to. Metrics to be sent will be discarded. This message will be skipped for the next 20 times.
Created 06-04-2018 01:20 PM
The following steps would help you in cleaning up Ambari Metrics System data in a given cluster.
Important Note:
Step-by-step guide
Hope that helps !
Created 06-04-2018 01:58 PM
I think in my case Service operation mode is embedded - hbase.rootdir is the folder = /var/var/lib/ambari-metrics-collector/hbase
so I need to remove all?
under
/var/var/lib/ambari-metrics-collector/hbase
/var/lib/ambari-metrics-collector/hbase-tmp/zookeeper
/var/lib/ambari-metrics-collector/hbase-tmp/phoenix-spool/
ls -ltr /var/var/lib/ambari-metrics-collector/hbase total 32 -rw-r--r--. 1 ams hadoop 7 Jul 13 2017 hbase.version -rw-r--r--. 1 ams hadoop 42 Jul 13 2017 hbase.id drwxr-xr-x. 4 ams hadoop 32 Jul 13 2017 data drwxr-xr-x 2 ams hadoop 6 Jul 16 2017 corrupt drwxr-xr-x. 34 ams hadoop 24576 Jun 4 13:20 WALs drwxr-xr-x. 2 ams hadoop 43 Jun 4 13:34 MasterProcWALs drwxr-xr-x 2 ams hadoop 6 Jun 4 13:36 archive drwxr-xr-x. 2 ams hadoop 6 Jun 4 13:43 oldWALs
Created 06-04-2018 01:51 PM
Whats your cluster size?
Created 06-04-2018 02:03 PM
3 master machines , 5 workers machines , zoo are inside the masters
Created 06-04-2018 02:06 PM
BTW ignore from my previous remark ( that was deleted ) about the - hbase_master_heapsize , its not help us
Created 06-04-2018 02:21 PM
Created 06-04-2018 02:25 PM
yes we already performed that -:) ( see step 1 in my question - tuning the metrics conf by : https://cwiki.apache.org/confluence/display/AMBARI/Configurations+-+Tuning)
Created 06-04-2018 03:09 PM
now I see these warning , not seen before ( after clean all under folders )
2018-06-04 15:03:39,739 WARN [RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=48232] io.FSDataInputStreamWrapper: Failed to invoke 'unbuffer' method in class class org.apache.hadoop.fs.FSDataInputStream . So there may be a TCP socket connection left open in CLOSE_WAIT state. java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hbase.io.FSDataInputStreamWrapper.unbuffer(FSDataInputStreamWrapper.java:263) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.unbufferStream(HFileBlock.java:1788) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.unbufferStream(HFileReaderV2.java:1403) at org.apache.hadoop.hbase.io.hfile.AbstractHFileReader$Scanner.close(AbstractHFileReader.java:343) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.close(StoreFileScanner.java:252) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.close(KeyValueHeap.java:222) at org.apache.hadoop.hbase.regionserver.StoreScanner.close(StoreScanner.java:449) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.close(KeyValueHeap.java:217) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.close(HRegion.java:6198) at org.apache.phoenix.coprocessor.BaseScannerRegionObserver$2.close(BaseScannerRegionObserver.java:371) at org.apache.phoenix.coprocessor.HashJoinRegionScanner.close(HashJoinRegionScanner.java:296) at org.apache.phoenix.coprocessor.BaseScannerRegionObserver$1.close(BaseScannerRegionObserver.java:244) at org.apache.hadoop.hbase.regionserver.RSRpcServices.closeScanner(RSRpcServices.java:2717) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2674) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32385) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2150) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:167) Caused by: java.lang.UnsupportedOperationException: this stream does not support unbuffering. at org.apache.hadoop.fs.FSDataInputStream.unbuffer(FSDataInputStream.java:233) ... 22 more 2
Created 06-04-2018 04:22 PM
@Geoffrey I found also this - https://issues.apache.org/jira/browse/HADOOP-14864?attachmentSortBy=dateTime
is it something that realted to our problem ?
Created 06-04-2018 04:34 PM
after one hour that metrics collector was up now it down and we see that logs:
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.UnsupportedOperationException: this stream does not support unbuffering.
at org.apache.hadoop.fs.FSDataInputStream.unbuffer(FSDataInputStream.java:233)
... 29 more
2018-06-04 16:19:26,938 INFO [timeline] timeline.HadoopTimelineMetricsSink: No live collector to send metrics to. Metrics to be sent will be discarded. This message will be skipped for the next 20 times.
Created 06-04-2018 09:25 PM
This is a known issue in the HBase version used by AMS in Ambari 2.6.1. Please downgrade AMS version to 2.6.0 using the following steps.
There were minimal changes in AMS from 2.6.0 to 2.6.1. You can also bring back the 2.6.1 versions of ambari-metrics-* jars in /usr/lib/ambari-metrics-collector after the yum downgrade. Meaning, using newest version of AMS jars + older version of HBase.
Created 06-05-2018 05:46 AM
do you mean this is known issue about the warning that I get - "WARN [RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=48232] io.FSDataInputStreamWrapper: Failed to invoke 'unbuffer' method in class class org.apache.hadoop.fs.FSDataInputStream . So there may be a TCP socket connection left open in CLOSE_WAIT state." ?