Created 12-22-2016 09:14 AM
Ambari Metrics works intermittently. It works a few minutes and then stops to show pictures. Sometimes, the Metric Collector just stops, after a manual start it's working but in a few minutes it stops again. What is going wrong?
My settings
HDP 2.5 Ambari 2.4.0.1 No Kerberos iptables off
hbase.zookeeper.property.tickTime = 6000 Metrics Service operation mode = distributed hbase.cluster.distributed = true hbase.zookeeper.property.clientPort = 2181 hbase.rootdir=hdfs://prodcluster/ams/hbase
Logs
ambari-metrics-collector.log 2016-12-22 11:59:23,030 ERROR org.mortbay.log: /ws/v1/timeline/metrics javax.ws.rs.WebApplicationException: javax.xml.bind.MarshalException - with linked exception: [org.mortbay.jetty.EofException] at com.sun.jersey.core.provider.jaxb.AbstractRootElementProvider.writeTo(AbstractRootElementProvider.java:159) at com.sun.jersey.spi.container.ContainerResponse.write(ContainerResponse.java:306) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1437) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:895) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:843) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:804) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1294) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:767) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: javax.xml.bind.MarshalException - with linked exception: [org.mortbay.jetty.EofException] at com.sun.xml.bind.v2.runtime.MarshallerImpl.write(MarshallerImpl.java:325) at com.sun.xml.bind.v2.runtime.MarshallerImpl.marshal(MarshallerImpl.java:249) at javax.xml.bind.helpers.AbstractMarshallerImpl.marshal(AbstractMarshallerImpl.java:95) at com.sun.jersey.core.provider.jaxb.AbstractRootElementProvider.writeTo(AbstractRootElementProvider.java:179) at com.sun.jersey.core.provider.jaxb.AbstractRootElementProvider.writeTo(AbstractRootElementProvider.java:157) ... 37 more Caused by: org.mortbay.jetty.EofException at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:634) at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:580) at com.sun.jersey.spi.container.servlet.WebComponent$Writer.write(WebComponent.java:307) at com.sun.jersey.spi.container.ContainerResponse$CommittingOutputStream.write(ContainerResponse.java:134) at com.sun.xml.bind.v2.runtime.output.UTF8XmlOutput.flushBuffer(UTF8XmlOutput.java:416) at com.sun.xml.bind.v2.runtime.output.UTF8XmlOutput.endDocument(UTF8XmlOutput.java:141) at com.sun.xml.bind.v2.runtime.XMLSerializer.endDocument(XMLSerializer.java:856) at com.sun.xml.bind.v2.runtime.MarshallerImpl.postwrite(MarshallerImpl.java:374) at com.sun.xml.bind.v2.runtime.MarshallerImpl.write(MarshallerImpl.java:321) ... 41 more 2016-12-22 12:00:05,549 INFO TimelineClusterAggregatorMinute: 0 row(s) updated. 2016-12-22 12:00:19,105 INFO TimelineClusterAggregatorMinute: Aggregated cluster metrics for METRIC_AGGREGATE_MINUTE, with startTime = Thu Dec 22 11:50:00 MSK 2016, endTime = Thu Dec 22 11:55:00 MSK 2016 2016-12-22 12:00:19,111 INFO TimelineClusterAggregatorMinute: End aggregation cycle @ Thu Dec 22 12:00:19 MSK 2016 2016-12-22 12:00:19,111 INFO TimelineClusterAggregatorMinute: End aggregation cycle @ Thu Dec 22 12:00:19 MSK 2016 2016-12-22 12:00:24,077 INFO TimelineClusterAggregatorMinute: Started Timeline aggregator thread @ Thu Dec 22 12:00:24 MSK 2016 2016-12-22 12:00:24,083 INFO TimelineClusterAggregatorMinute: Last Checkpoint read : Thu Dec 22 11:55:00 MSK 2016 2016-12-22 12:00:24,083 INFO TimelineClusterAggregatorMinute: Rounded off checkpoint : Thu Dec 22 11:55:00 MSK 2016 2016-12-22 12:00:24,084 INFO TimelineClusterAggregatorMinute: Last check point time: 1482396900000, lagBy: 324 seconds. 2016-12-22 12:00:24,085 INFO TimelineClusterAggregatorMinute: Start aggregation cycle @ Thu Dec 22 12:00:24 MSK 2016, startTime = Thu Dec 22 11:55:00 MSK 2016, endTime = Thu Dec 22 12:00:00 MSK 2016
hbase-ams-master-hdp-nn2.hostname.log 2016-12-22 11:22:46,455 INFO [hdp-nn2.hostname,61300,1482394421267_ChoreService_1] zookeeper.ZooKeeper: Session: 0x3570f2523bb3db4 closed 2016-12-22 11:22:46,455 INFO [hdp-nn2.hostname,61300,1482394421267_ChoreService_1-EventThread] zookeeper.ClientCnxn: EventThread shut down 2016-12-22 11:23:31,207 INFO [timeline] timeline.HadoopTimelineMetricsSink: Unable to connect to collector, http://hdp-nn2.hostname:6188/ws/v1/timeline/metrics This exceptions will be ignored for next 100 times 2016-12-22 11:23:31,208 WARN [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://hdp-nn2.hostname:6188/ws/v1/timeline/metrics 2016-12-22 11:23:41,484 INFO [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=159.41 KB, freeSize=150.39 MB, max=150.54 MB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=59, evicted=0, evictedPerRun=0.0 2016-12-22 11:23:46,460 INFO [hdp-nn2.hostname,61300,1482394421267_ChoreService_1] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x12f8ba29 connecting to ZooKeeper ensemble=hdp-nn1.hostname:2181,hdp-dn1.hostname:2181,hdp-nn2.hostname:2181 2016-12-22 11:23:46,461 INFO [hdp-nn2.hostname,61300,1482394421267_ChoreService_1] zookeeper.ZooKeeper: Initiating client connection, connectString=hdp-nn1.hostname:2181,hdp-dn1.hostname:2181,hdp-nn2.hostname:2181 sessionTimeout=120000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@20fc295f 2016-12-22 11:23:46,464 INFO [hdp-nn2.hostname,61300,1482394421267_ChoreService_1-SendThread(hdp-nn2.hostname:2181)] zookeeper.ClientCnxn: Opening socket connection to server hdp-nn2.hostname/10.255.242.181:2181. Will not attempt to authenticate using SASL (unknown error) 2016-12-22 11:23:46,466 INFO [hdp-nn2.hostname,61300,1482394421267_ChoreService_1-SendThread(hdp-nn2.hostname:2181)] zookeeper.ClientCnxn: Socket connection established to hdp-nn2.hostname/10.255.242.181:2181, initiating session 2016-12-22 11:23:46,469 INFO [hdp-nn2.hostname,61300,1482394421267_ChoreService_1-SendThread(hdp-nn2.hostname:2181)] zookeeper.ClientCnxn: Session establishment complete on server hdp-nn2.hostname/10.255.242.181:2181, sessionid = 0x3570f2523bb3db5, negotiated timeout = 40000 2016-12-22 11:23:46,495 INFO [hdp-nn2.hostname,61300,1482394421267_ChoreService_1] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x3570f2523bb3db5 2016-12-22 11:23:46,499 INFO [hdp-nn2.hostname,61300,1482394421267_ChoreService_1] zookeeper.ZooKeeper: Session: 0x3570f2523bb3db5 closed 2016-12-22 11:23:46,499 INFO [hdp-nn2.hostname,61300,1482394421267_ChoreService_1-EventThread] zookeeper.ClientCnxn: EventThread shut down 2016-12-22 11:28:41,485 INFO [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=159.41 KB, freeSize=150.39 MB, max=150.54 MB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=89, evicted=0, evictedPerRun=0.0 2016-12-22 11:29:49,178 INFO [WALProcedureStoreSyncThread] wal.WALProcedureStore: Remove log: hdfs://prodcluster/ams/hbase/MasterProcWALs/state-00000000000000000001.log 2016-12-22 11:29:49,180 INFO [WALProcedureStoreSyncThread] wal.WALProcedureStore: Removed logs: [hdfs://prodcluster/ams/hbase/MasterProcWALs/state-00000000000000000002.log] 2016-12-22 11:33:41,484 INFO [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=159.41 KB, freeSize=150.39 MB, max=150.54 MB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=119, evicted=0, evictedPerRun=0.0 2016-12-22 11:38:41,484 INFO [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=159.41 KB, freeSize=150.39 MB, max=150.54 MB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=149, evicted=0, evictedPerRun=0.0 2016-12-22 11:43:41,485 INFO [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=159.41 KB, freeSize=150.39 MB, max=150.54 MB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=179, evicted=0, evictedPerRun=0.0 2016-12-22 11:44:51,222 INFO [timeline] timeline.HadoopTimelineMetricsSink: Unable to connect to collector, http://hdp-nn2.hostname:6188/ws/v1/timeline/metrics This exceptions will be ignored for next 100 times 2016-12-22 11:44:51,223 WARN [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://hdp-nn2.hostname:6188/ws/v1/timeline/metrics 2016-12-22 11:48:41,484 INFO [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=159.41 KB, freeSize=150.39 MB, max=150.54 MB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=209, evicted=0, evictedPerRun=0.0 2016-12-22 11:50:51,205 INFO [timeline] timeline.HadoopTimelineMetricsSink: Unable to connect to collector, http://hdp-nn2.hostname:6188/ws/v1/timeline/metrics This exceptions will be ignored for next 100 times 2016-12-22 11:50:51,205 WARN [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://hdp-nn2.hostname:6188/ws/v1/timeline/metrics
I removed folder /var/lib/ambari-metrics-collector/hbase-tmp and restarted AMS as recommended here https://community.hortonworks.com/articles/11805/how-to-solve-ambari-metrics-corrupted-data.html, but it did not help.
Created 12-27-2016 05:00 AM
Yesterday suddenly Ambari Metrics has started working (and still works). The only thing I have changed yesterday - install Apache Atlas, which required restart almost all components, may be it helped. Thanks for your assistance!
Created on 11-04-2019 11:36 PM - edited 11-05-2019 01:03 AM
You need to stop ambari metrics service via ambari and then remove all temp files. Go to Ambari Metrics collector service host. and execute the below command.
mv /var/lib/ambari-metrics-collector /tmp/ambari-metrics-collector_OLD
Now you can restart ams service again and now you should be good with Ambari Metrics.
Created 02-20-2018 07:25 AM
I am having the same problem. Here is ambari-metrics-collector.log.