Created 12-22-2016 09:14 AM
Ambari Metrics works intermittently. It works a few minutes and then stops to show pictures. Sometimes, the Metric Collector just stops, after a manual start it's working but in a few minutes it stops again. What is going wrong?
My settings
HDP 2.5 Ambari 2.4.0.1 No Kerberos iptables off
hbase.zookeeper.property.tickTime = 6000 Metrics Service operation mode = distributed hbase.cluster.distributed = true hbase.zookeeper.property.clientPort = 2181 hbase.rootdir=hdfs://prodcluster/ams/hbase
Logs
ambari-metrics-collector.log 2016-12-22 11:59:23,030 ERROR org.mortbay.log: /ws/v1/timeline/metrics javax.ws.rs.WebApplicationException: javax.xml.bind.MarshalException - with linked exception: [org.mortbay.jetty.EofException] at com.sun.jersey.core.provider.jaxb.AbstractRootElementProvider.writeTo(AbstractRootElementProvider.java:159) at com.sun.jersey.spi.container.ContainerResponse.write(ContainerResponse.java:306) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1437) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:895) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:843) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:804) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1294) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:767) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: javax.xml.bind.MarshalException - with linked exception: [org.mortbay.jetty.EofException] at com.sun.xml.bind.v2.runtime.MarshallerImpl.write(MarshallerImpl.java:325) at com.sun.xml.bind.v2.runtime.MarshallerImpl.marshal(MarshallerImpl.java:249) at javax.xml.bind.helpers.AbstractMarshallerImpl.marshal(AbstractMarshallerImpl.java:95) at com.sun.jersey.core.provider.jaxb.AbstractRootElementProvider.writeTo(AbstractRootElementProvider.java:179) at com.sun.jersey.core.provider.jaxb.AbstractRootElementProvider.writeTo(AbstractRootElementProvider.java:157) ... 37 more Caused by: org.mortbay.jetty.EofException at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:634) at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:580) at com.sun.jersey.spi.container.servlet.WebComponent$Writer.write(WebComponent.java:307) at com.sun.jersey.spi.container.ContainerResponse$CommittingOutputStream.write(ContainerResponse.java:134) at com.sun.xml.bind.v2.runtime.output.UTF8XmlOutput.flushBuffer(UTF8XmlOutput.java:416) at com.sun.xml.bind.v2.runtime.output.UTF8XmlOutput.endDocument(UTF8XmlOutput.java:141) at com.sun.xml.bind.v2.runtime.XMLSerializer.endDocument(XMLSerializer.java:856) at com.sun.xml.bind.v2.runtime.MarshallerImpl.postwrite(MarshallerImpl.java:374) at com.sun.xml.bind.v2.runtime.MarshallerImpl.write(MarshallerImpl.java:321) ... 41 more 2016-12-22 12:00:05,549 INFO TimelineClusterAggregatorMinute: 0 row(s) updated. 2016-12-22 12:00:19,105 INFO TimelineClusterAggregatorMinute: Aggregated cluster metrics for METRIC_AGGREGATE_MINUTE, with startTime = Thu Dec 22 11:50:00 MSK 2016, endTime = Thu Dec 22 11:55:00 MSK 2016 2016-12-22 12:00:19,111 INFO TimelineClusterAggregatorMinute: End aggregation cycle @ Thu Dec 22 12:00:19 MSK 2016 2016-12-22 12:00:19,111 INFO TimelineClusterAggregatorMinute: End aggregation cycle @ Thu Dec 22 12:00:19 MSK 2016 2016-12-22 12:00:24,077 INFO TimelineClusterAggregatorMinute: Started Timeline aggregator thread @ Thu Dec 22 12:00:24 MSK 2016 2016-12-22 12:00:24,083 INFO TimelineClusterAggregatorMinute: Last Checkpoint read : Thu Dec 22 11:55:00 MSK 2016 2016-12-22 12:00:24,083 INFO TimelineClusterAggregatorMinute: Rounded off checkpoint : Thu Dec 22 11:55:00 MSK 2016 2016-12-22 12:00:24,084 INFO TimelineClusterAggregatorMinute: Last check point time: 1482396900000, lagBy: 324 seconds. 2016-12-22 12:00:24,085 INFO TimelineClusterAggregatorMinute: Start aggregation cycle @ Thu Dec 22 12:00:24 MSK 2016, startTime = Thu Dec 22 11:55:00 MSK 2016, endTime = Thu Dec 22 12:00:00 MSK 2016
hbase-ams-master-hdp-nn2.hostname.log 2016-12-22 11:22:46,455 INFO [hdp-nn2.hostname,61300,1482394421267_ChoreService_1] zookeeper.ZooKeeper: Session: 0x3570f2523bb3db4 closed 2016-12-22 11:22:46,455 INFO [hdp-nn2.hostname,61300,1482394421267_ChoreService_1-EventThread] zookeeper.ClientCnxn: EventThread shut down 2016-12-22 11:23:31,207 INFO [timeline] timeline.HadoopTimelineMetricsSink: Unable to connect to collector, http://hdp-nn2.hostname:6188/ws/v1/timeline/metrics This exceptions will be ignored for next 100 times 2016-12-22 11:23:31,208 WARN [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://hdp-nn2.hostname:6188/ws/v1/timeline/metrics 2016-12-22 11:23:41,484 INFO [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=159.41 KB, freeSize=150.39 MB, max=150.54 MB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=59, evicted=0, evictedPerRun=0.0 2016-12-22 11:23:46,460 INFO [hdp-nn2.hostname,61300,1482394421267_ChoreService_1] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x12f8ba29 connecting to ZooKeeper ensemble=hdp-nn1.hostname:2181,hdp-dn1.hostname:2181,hdp-nn2.hostname:2181 2016-12-22 11:23:46,461 INFO [hdp-nn2.hostname,61300,1482394421267_ChoreService_1] zookeeper.ZooKeeper: Initiating client connection, connectString=hdp-nn1.hostname:2181,hdp-dn1.hostname:2181,hdp-nn2.hostname:2181 sessionTimeout=120000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@20fc295f 2016-12-22 11:23:46,464 INFO [hdp-nn2.hostname,61300,1482394421267_ChoreService_1-SendThread(hdp-nn2.hostname:2181)] zookeeper.ClientCnxn: Opening socket connection to server hdp-nn2.hostname/10.255.242.181:2181. Will not attempt to authenticate using SASL (unknown error) 2016-12-22 11:23:46,466 INFO [hdp-nn2.hostname,61300,1482394421267_ChoreService_1-SendThread(hdp-nn2.hostname:2181)] zookeeper.ClientCnxn: Socket connection established to hdp-nn2.hostname/10.255.242.181:2181, initiating session 2016-12-22 11:23:46,469 INFO [hdp-nn2.hostname,61300,1482394421267_ChoreService_1-SendThread(hdp-nn2.hostname:2181)] zookeeper.ClientCnxn: Session establishment complete on server hdp-nn2.hostname/10.255.242.181:2181, sessionid = 0x3570f2523bb3db5, negotiated timeout = 40000 2016-12-22 11:23:46,495 INFO [hdp-nn2.hostname,61300,1482394421267_ChoreService_1] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x3570f2523bb3db5 2016-12-22 11:23:46,499 INFO [hdp-nn2.hostname,61300,1482394421267_ChoreService_1] zookeeper.ZooKeeper: Session: 0x3570f2523bb3db5 closed 2016-12-22 11:23:46,499 INFO [hdp-nn2.hostname,61300,1482394421267_ChoreService_1-EventThread] zookeeper.ClientCnxn: EventThread shut down 2016-12-22 11:28:41,485 INFO [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=159.41 KB, freeSize=150.39 MB, max=150.54 MB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=89, evicted=0, evictedPerRun=0.0 2016-12-22 11:29:49,178 INFO [WALProcedureStoreSyncThread] wal.WALProcedureStore: Remove log: hdfs://prodcluster/ams/hbase/MasterProcWALs/state-00000000000000000001.log 2016-12-22 11:29:49,180 INFO [WALProcedureStoreSyncThread] wal.WALProcedureStore: Removed logs: [hdfs://prodcluster/ams/hbase/MasterProcWALs/state-00000000000000000002.log] 2016-12-22 11:33:41,484 INFO [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=159.41 KB, freeSize=150.39 MB, max=150.54 MB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=119, evicted=0, evictedPerRun=0.0 2016-12-22 11:38:41,484 INFO [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=159.41 KB, freeSize=150.39 MB, max=150.54 MB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=149, evicted=0, evictedPerRun=0.0 2016-12-22 11:43:41,485 INFO [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=159.41 KB, freeSize=150.39 MB, max=150.54 MB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=179, evicted=0, evictedPerRun=0.0 2016-12-22 11:44:51,222 INFO [timeline] timeline.HadoopTimelineMetricsSink: Unable to connect to collector, http://hdp-nn2.hostname:6188/ws/v1/timeline/metrics This exceptions will be ignored for next 100 times 2016-12-22 11:44:51,223 WARN [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://hdp-nn2.hostname:6188/ws/v1/timeline/metrics 2016-12-22 11:48:41,484 INFO [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=159.41 KB, freeSize=150.39 MB, max=150.54 MB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=209, evicted=0, evictedPerRun=0.0 2016-12-22 11:50:51,205 INFO [timeline] timeline.HadoopTimelineMetricsSink: Unable to connect to collector, http://hdp-nn2.hostname:6188/ws/v1/timeline/metrics This exceptions will be ignored for next 100 times 2016-12-22 11:50:51,205 WARN [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://hdp-nn2.hostname:6188/ws/v1/timeline/metrics
I removed folder /var/lib/ambari-metrics-collector/hbase-tmp and restarted AMS as recommended here https://community.hortonworks.com/articles/11805/how-to-solve-ambari-metrics-corrupted-data.html, but it did not help.
Created 12-27-2016 05:00 AM
Yesterday suddenly Ambari Metrics has started working (and still works). The only thing I have changed yesterday - install Apache Atlas, which required restart almost all components, may be it helped. Thanks for your assistance!
Created 12-22-2016 05:42 PM
Can you try steps given in https://cwiki.apache.org/confluence/display/AMBARI/Cleaning+up+Ambari+Metrics+System+Data ?
Created 12-22-2016 06:14 PM
Can you verify if the Ambari and AMS versions are the same using 'rpm -qa | grep ambari'?
How many nodes do you have in your cluster?
Please share the contents of /etc/ambari-metrics-collector/conf/ams-env.sh and /etc/ams-hbase/conf/hbase-env.sh in Metrics collector host.
Created 12-23-2016 06:10 AM
I have 7 nodes (2 nn + 5 dn).
Here is info:
rpm -qa | grep ambari ambari-metrics-collector-2.4.0.1-1.x86_64 ambari-metrics-hadoop-sink-2.4.0.1-1.x86_64 ambari-agent-2.4.0.1-1.x86_64 ambari-infra-solr-client-2.4.0.1-1.x86_64 ambari-logsearch-logfeeder-2.4.0.1-1.x86_64 ambari-metrics-monitor-2.4.0.1-1.x86_64 ambari-metrics-grafana-2.4.0.1-1.x86_64 ambari-infra-solr-2.4.0.1-1.x86_64
cat /etc/ambari-metrics-collector/conf/ams-env.sh # Set environment variables here. # The java implementation to use. Java 1.6 required. export JAVA_HOME=/usr/jdk64/jdk1.8.0_77 # Collector Log directory for log4j export AMS_COLLECTOR_LOG_DIR=/var/log/ambari-metrics-collector # Monitor Log directory for outfile export AMS_MONITOR_LOG_DIR=/var/log/ambari-metrics-monitor # Collector pid directory export AMS_COLLECTOR_PID_DIR=/var/run/ambari-metrics-collector # Monitor pid directory export AMS_MONITOR_PID_DIR=/var/run/ambari-metrics-monitor # AMS HBase pid directory export AMS_HBASE_PID_DIR=/var/run/ambari-metrics-collector/ # AMS Collector heapsize export AMS_COLLECTOR_HEAPSIZE=1024m # HBase normalizer enabled export AMS_HBASE_NORMALIZER_ENABLED=False # HBase compaction policy enabled export AMS_HBASE_FIFO_COMPACTION_ENABLED=True # HBase Tables Initialization check enabled export AMS_HBASE_INIT_CHECK_ENABLED=True # AMS Collector options export AMS_COLLECTOR_OPTS="-Djava.library.path=/usr/lib/ams-hbase/lib/hadoop-native" # AMS Collector GC options export AMS_COLLECTOR_GC_OPTS="-XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/ambari-metrics-collector/collector-gc.log-`date +'%Y%m%d%H%M'`" export AMS_COLLECTOR_OPTS="$AMS_COLLECTOR_OPTS $AMS_COLLECTOR_GC_OPTS"
cat /etc/ams-hbase/conf/hbase-env.sh # Set environment variables here. # The java implementation to use. Java 1.6+ required. export JAVA_HOME=/usr/jdk64/jdk1.8.0_77 # HBase Configuration directory export HBASE_CONF_DIR=${HBASE_CONF_DIR:-/etc/ams-hbase/conf} # Extra Java CLASSPATH elements. Optional. additional_cp= if [ -n "$additional_cp" ]; then export HBASE_CLASSPATH=${HBASE_CLASSPATH}:$additional_cp else export HBASE_CLASSPATH=${HBASE_CLASSPATH} fi # The maximum amount of heap to use for hbase shell. export HBASE_SHELL_OPTS="-Xmx256m" # Extra Java runtime options. # Below are what we set by default. May only work with SUN JVM. # For more on why as well as other possible settings, # see http://wiki.apache.org/hadoop/PerformanceTuning export HBASE_OPTS="-XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/ambari-metrics-collector/hs_err_pid%p.log -Djava.io.tmpdir=/var/lib/ambari-metrics-collector/hbase-tmp" export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/ambari-metrics-collector/gc.log-`date +'%Y%m%d%H%M'`" # Uncomment below to enable java garbage collection logging. # export HBASE_OPTS="$HBASE_OPTS -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:$HBASE_HOME/logs/gc-hbase.log" # Uncomment and adjust to enable JMX exporting # See jmxremote.password and jmxremote.access in $JRE_HOME/lib/management to configure remote password access. # More details at: http://java.sun.com/javase/6/docs/technotes/guides/management/agent.html # # export HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false" export HBASE_MASTER_OPTS=" -Xms512m -Xmx512m -Xmn102m -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly" export HBASE_REGIONSERVER_OPTS=" -Xmn128m -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -Xms896m -Xmx896m" # export HBASE_THRIFT_OPTS="$HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10103" # export HBASE_ZOOKEEPER_OPTS="$HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10104" # File naming hosts on which HRegionServers will run. $HBASE_HOME/conf/regionservers by default. export HBASE_REGIONSERVERS=${HBASE_CONF_DIR}/regionservers # Extra ssh options. Empty by default. # export HBASE_SSH_OPTS="-o ConnectTimeout=1 -o SendEnv=HBASE_CONF_DIR" # Where log files are stored. $HBASE_HOME/logs by default. export HBASE_LOG_DIR=/var/log/ambari-metrics-collector # A string representing this instance of hbase. $USER by default. # export HBASE_IDENT_STRING=$USER # The scheduling priority for daemon processes. See 'man nice'. # export HBASE_NICENESS=10 # The directory where pid files are stored. /tmp by default. export HBASE_PID_DIR=/var/run/ambari-metrics-collector/ # Seconds to sleep between slave commands. Unset by default. This # can be useful in large clusters, where, e.g., slave rsyncs can # otherwise arrive faster than the master can service them. # export HBASE_SLAVE_SLEEP=0.1 # Tell HBase whether it should manage it's own instance of Zookeeper or not. export HBASE_MANAGES_ZK=false # use embedded native libs _HADOOP_NATIVE_LIB="/usr/lib/ams-hbase/lib/hadoop-native/" export HBASE_OPTS="$HBASE_OPTS -Djava.library.path=${_HADOOP_NATIVE_LIB}" # Unset HADOOP_HOME to avoid importing HADOOP installed cluster related configs like: /usr/hdp/2.2.0.0-2041/hadoop/conf/ export HADOOP_HOME=/usr/lib/ams-hbase/ # Explicitly Setting HBASE_HOME for AMS HBase so that there is no conflict export HBASE_HOME=/usr/lib/ams-hbase/
Created 12-23-2016 06:10 AM
If you are trying to clean up AMS data in distributed mode, you have to stop Metrics collector first. Then you have to clean up the HBase rootdir in HDFS, as well as delete the znode in cluster Zk service. You should be able to do it using the zkCli utility. The znode to delete will be /ams-hbase-unsecure.
Created 12-23-2016 06:02 AM
I have tried, but without success((
Here ambari-metrics-collector.log
2016-12-23 08:49:16,046 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping phoenix metrics system... 2016-12-23 08:49:16,047 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: phoenix metrics system stopped. 2016-12-23 08:49:16,048 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: phoenix metrics system shutdown complete. 2016-12-23 08:49:16,048 INFO org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl: Stopping ApplicationHistory 2016-12-23 08:49:16,048 FATAL org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer: Error starting ApplicationHistoryServer org.apache.hadoop.yarn.server.applicationhistoryservice.metrics.timeline.MetricsSystemInitializationException: Error creating Metrics Schema in HBase using Phoenix. at org.apache.hadoop.yarn.server.applicationhistoryservice.metrics.timeline.PhoenixHBaseAccessor.initMetricSchema(PhoenixHBaseAccessor.java:470) at org.apache.hadoop.yarn.server.applicationhistoryservice.metrics.timeline.HBaseTimelineMetricStore.initializeSubsystem(HBaseTimelineMetricStore.java:94) at org.apache.hadoop.yarn.server.applicationhistoryservice.metrics.timeline.HBaseTimelineMetricStore.serviceInit(HBaseTimelineMetricStore.java:86) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:84) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:137) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:147) Caused by: org.apache.phoenix.exception.PhoenixIOException: SYSTEM.CATALOG at org.apache.phoenix.util.ServerUtil.parseServerException(ServerUtil.java:111) at org.apache.phoenix.query.ConnectionQueryServicesImpl.metaDataCoprocessorExec(ConnectionQueryServicesImpl.java:1292) at org.apache.phoenix.query.ConnectionQueryServicesImpl.metaDataCoprocessorExec(ConnectionQueryServicesImpl.java:1257) at org.apache.phoenix.query.ConnectionQueryServicesImpl.createTable(ConnectionQueryServicesImpl.java:1453) at org.apache.phoenix.schema.MetaDataClient.createTableInternal(MetaDataClient.java:2180) at org.apache.phoenix.schema.MetaDataClient.createTable(MetaDataClient.java:865) at org.apache.phoenix.compile.CreateTableCompiler$2.execute(CreateTableCompiler.java:194) at org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:343) at org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixStatement.java:331) at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53) at org.apache.phoenix.jdbc.PhoenixStatement.executeMutation(PhoenixStatement.java:329) at org.apache.phoenix.jdbc.PhoenixStatement.executeUpdate(PhoenixStatement.java:1421) at org.apache.phoenix.query.ConnectionQueryServicesImpl$13.call(ConnectionQueryServicesImpl.java:2378) at org.apache.phoenix.query.ConnectionQueryServicesImpl$13.call(ConnectionQueryServicesImpl.java:2327) at org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:78) at org.apache.phoenix.query.ConnectionQueryServicesImpl.init(ConnectionQueryServicesImpl.java:2327) at org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:233) at org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.createConnection(PhoenixEmbeddedDriver.java:142) at org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:202) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:270) at org.apache.hadoop.yarn.server.applicationhistoryservice.metrics.timeline.query.DefaultPhoenixDataSource.getConnection(DefaultPhoenixDataSource.java:82) at org.apache.hadoop.yarn.server.applicationhistoryservice.metrics.timeline.PhoenixHBaseAccessor.getConnection(PhoenixHBaseAccessor.java:376) at org.apache.hadoop.yarn.server.applicationhistoryservice.metrics.timeline.PhoenixHBaseAccessor.getConnectionRetryingOnException(PhoenixHBaseAccessor.java:354) at org.apache.hadoop.yarn.server.applicationhistoryservice.metrics.timeline.PhoenixHBaseAccessor.initMetricSchema(PhoenixHBaseAccessor.java:398) ... 8 more Caused by: org.apache.hadoop.hbase.TableNotFoundException: SYSTEM.CATALOG at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1264) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1162) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1146) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1103) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getRegionLocation(ConnectionManager.java:938) at org.apache.hadoop.hbase.client.HRegionLocator.getRegionLocation(HRegionLocator.java:83) at org.apache.hadoop.hbase.client.HTable.getRegionLocation(HTable.java:504) at org.apache.hadoop.hbase.client.HTable.getKeysAndRegionsInRange(HTable.java:720) at org.apache.hadoop.hbase.client.HTable.getKeysAndRegionsInRange(HTable.java:690) at org.apache.hadoop.hbase.client.HTable.getStartKeysInRange(HTable.java:1757) at org.apache.hadoop.hbase.client.HTable.coprocessorService(HTable.java:1712) at org.apache.hadoop.hbase.client.HTable.coprocessorService(HTable.java:1692) at org.apache.phoenix.query.ConnectionQueryServicesImpl.metaDataCoprocessorExec(ConnectionQueryServicesImpl.java:1275) ... 31 more 2016-12-23 08:49:16,052 INFO org.apache.hadoop.util.ExitUtil: Exiting with status -1 2016-12-23 08:49:16,069 INFO org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down ApplicationHistoryServer at hdp-nn2.hostname/10.255.242.181 ************************************************************/ 2016-12-23 08:49:16,115 WARN org.apache.hadoop.hbase.io.util.HeapMemorySizeUtil: hbase.regionserver.global.memstore.upperLimit is deprecated by hbase.regionserver.global.memstore.size
Created 12-23-2016 08:50 AM
I did this:
1) Turn on Maintenance mode 2) Stop Ambari Metrics 3) hadoop fs -rmr /ams/hbase/* 4) rm -rf /var/lib/ambari-metrics-collector/hbase-tmp/* 5) [zk: localhost:2181(CONNECTED) 0] ls / [registry, controller, brokers, storm, zookeeper, infra-solr, hiveserver2-hive2, hbase-unsecure, yarn-leader-election, tracers, hadoop-ha, admin, isr_change_notification, services, templeton-hadoop, accumulo, controller_epoch, hiveserver2, llap-unsecure, rmstore, ranger_audits, consumers, config, ams-hbase-unsecure] [zk: localhost:2181(CONNECTED) 1] rmr /ams-hbase-unsecure [zk: localhost:2181(CONNECTED) 2] ls / [registry, controller, brokers, storm, zookeeper, infra-solr, hiveserver2-hive2, hbase-unsecure, yarn-leader-election, tracers, hadoop-ha, admin, isr_change_notification, services, templeton-hadoop, accumulo, controller_epoch, hiveserver2, llap-unsecure, rmstore, ranger_audits, consumers, config] 6) Start Ambari Metrics 7) Turn off Maintenance mode
After about 15 minutes I got this log:
2016-12-23 11:35:08,673 ERROR org.mortbay.log: /ws/v1/timeline/metrics/ javax.ws.rs.WebApplicationException: javax.xml.bind.MarshalException - with linked exception: [org.mortbay.jetty.EofException] at com.sun.jersey.core.provider.jaxb.AbstractRootElementProvider.writeTo(AbstractRootElementProvider.java:159) at com.sun.jersey.spi.container.ContainerResponse.write(ContainerResponse.java:306) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1437) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:895) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:843) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:804) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1294) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:767) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: javax.xml.bind.MarshalException - with linked exception: [org.mortbay.jetty.EofException] at com.sun.xml.bind.v2.runtime.MarshallerImpl.write(MarshallerImpl.java:325) at com.sun.xml.bind.v2.runtime.MarshallerImpl.marshal(MarshallerImpl.java:249) at javax.xml.bind.helpers.AbstractMarshallerImpl.marshal(AbstractMarshallerImpl.java:95) at com.sun.jersey.core.provider.jaxb.AbstractRootElementProvider.writeTo(AbstractRootElementProvider.java:179) at com.sun.jersey.core.provider.jaxb.AbstractRootElementProvider.writeTo(AbstractRootElementProvider.java:157) ... 37 more Caused by: org.mortbay.jetty.EofException at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:634) at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:580) at com.sun.jersey.spi.container.servlet.WebComponent$Writer.write(WebComponent.java:307) at com.sun.jersey.spi.container.ContainerResponse$CommittingOutputStream.write(ContainerResponse.java:134) at com.sun.xml.bind.v2.runtime.output.UTF8XmlOutput.flushBuffer(UTF8XmlOutput.java:416) at com.sun.xml.bind.v2.runtime.output.UTF8XmlOutput.endDocument(UTF8XmlOutput.java:141) at com.sun.xml.bind.v2.runtime.XMLSerializer.endDocument(XMLSerializer.java:856) at com.sun.xml.bind.v2.runtime.MarshallerImpl.postwrite(MarshallerImpl.java:374) at com.sun.xml.bind.v2.runtime.MarshallerImpl.write(MarshallerImpl.java:321) ... 41 more 2016-12-23 11:35:09,796 INFO org.apache.hadoop.yarn.server.applicationhistoryservice.metrics.timeline.PhoenixHBaseAccessor: Saved 8606 metadata records. 2016-12-23 11:35:09,843 INFO org.apache.hadoop.yarn.server.applicationhistoryservice.metrics.timeline.PhoenixHBaseAccessor: Saved 7 hosted apps metadata records. 2016-12-23 11:35:25,123 INFO TimelineClusterAggregatorMinute: Started Timeline aggregator thread @ Fri Dec 23 11:35:25 MSK 2016 2016-12-23 11:35:25,124 INFO TimelineClusterAggregatorMinute: Last Checkpoint read : Fri Dec 23 11:30:00 MSK 2016 2016-12-23 11:35:25,124 INFO TimelineClusterAggregatorMinute: Rounded off checkpoint : Fri Dec 23 11:30:00 MSK 2016 2016-12-23 11:35:25,124 INFO TimelineClusterAggregatorMinute: Last check point time: 1482481800000, lagBy: 325 seconds. 2016-12-23 11:35:25,124 INFO TimelineClusterAggregatorMinute: Start aggregation cycle @ Fri Dec 23 11:35:25 MSK 2016, startTime = Fri Dec 23 11:30:00 MSK 2016, endTime = Fri Dec 23 11:35:00 MSK 2016 2016-12-23 11:35:25,143 INFO TimelineClusterAggregatorMinute: 0 row(s) updated. 2016-12-23 11:35:25,143 INFO TimelineClusterAggregatorMinute: Aggregated cluster metrics for METRIC_AGGREGATE_MINUTE, with startTime = Fri Dec 23 11:30:00 MSK 2016, endTime = Fri Dec 23 11:35:00 MSK 2016 2016-12-23 11:35:25,143 INFO TimelineClusterAggregatorMinute: End aggregation cycle @ Fri Dec 23 11:35:25 MSK 2016 2016-12-23 11:35:25,143 INFO TimelineClusterAggregatorMinute: End aggregation cycle @ Fri Dec 23 11:35:25 MSK 2016 2016-12-23 11:35:25,152 INFO TimelineMetricHostAggregatorMinute: Started Timeline aggregator thread @ Fri Dec 23 11:35:25 MSK 2016 2016-12-23 11:35:25,153 INFO TimelineMetricHostAggregatorMinute: Last Checkpoint read : Fri Dec 23 11:30:00 MSK 2016 2016-12-23 11:35:25,153 INFO TimelineMetricHostAggregatorMinute: Rounded off checkpoint : Fri Dec 23 11:30:00 MSK 2016 2016-12-23 11:35:25,153 INFO TimelineMetricHostAggregatorMinute: Last check point time: 1482481800000, lagBy: 325 seconds. 2016-12-23 11:35:25,153 INFO TimelineMetricHostAggregatorMinute: Start aggregation cycle @ Fri Dec 23 11:35:25 MSK 2016, startTime = Fri Dec 23 11:30:00 MSK 2016, endTime = Fri Dec 23 11:35:00 MSK 2016 2016-12-23 11:35:25,907 INFO TimelineMetricHostAggregatorMinute: 0 row(s) updated. 2016-12-23 11:35:25,907 INFO TimelineMetricHostAggregatorMinute: Aggregated host metrics for METRIC_RECORD_MINUTE, with startTime = Fri Dec 23 11:30:00 MSK 2016, endTime = Fri Dec 23 11:35:00 MSK 2016 2016-12-23 11:35:25,907 INFO TimelineMetricHostAggregatorMinute: End aggregation cycle @ Fri Dec 23 11:35:25 MSK 2016 2016-12-23 11:35:25,907 INFO TimelineMetricHostAggregatorMinute: End aggregation cycle @ Fri Dec 23 11:35:25 MSK 2016 2016-12-23 11:40:26,448 INFO TimelineMetricHostAggregatorMinute: Started Timeline aggregator thread @ Fri Dec 23 11:40:26 MSK 2016 2016-12-23 11:40:26,448 INFO TimelineClusterAggregatorMinute: Started Timeline aggregator thread @ Fri Dec 23 11:40:26 MSK 2016 2016-12-23 11:40:26,449 INFO TimelineMetricHostAggregatorMinute: Last Checkpoint read : Fri Dec 23 11:35:00 MSK 2016 2016-12-23 11:40:26,449 INFO TimelineMetricHostAggregatorMinute: Rounded off checkpoint : Fri Dec 23 11:35:00 MSK 2016 2016-12-23 11:40:26,449 INFO TimelineMetricHostAggregatorMinute: Last check point time: 1482482100000, lagBy: 326 seconds. 2016-12-23 11:40:26,449 INFO TimelineClusterAggregatorMinute: Last Checkpoint read : Fri Dec 23 11:35:00 MSK 2016 2016-12-23 11:40:26,449 INFO TimelineMetricHostAggregatorMinute: Start aggregation cycle @ Fri Dec 23 11:40:26 MSK 2016, startTime = Fri Dec 23 11:35:00 MSK 2016, endTime = Fri Dec 23 11:40:00 MSK 2016 2016-12-23 11:40:26,450 INFO TimelineClusterAggregatorMinute: Rounded off checkpoint : Fri Dec 23 11:35:00 MSK 2016 2016-12-23 11:40:26,450 INFO TimelineClusterAggregatorMinute: Last check point time: 1482482100000, lagBy: 326 seconds. 2016-12-23 11:40:26,450 INFO TimelineClusterAggregatorMinute: Start aggregation cycle @ Fri Dec 23 11:40:26 MSK 2016, startTime = Fri Dec 23 11:35:00 MSK 2016, endTime = Fri Dec 23 11:40:00 MSK 2016 2016-12-23 11:40:26,464 INFO TimelineClusterAggregatorMinute: 0 row(s) updated. 2016-12-23 11:40:26,464 INFO TimelineClusterAggregatorMinute: Aggregated cluster metrics for METRIC_AGGREGATE_MINUTE, with startTime = Fri Dec 23 11:35:00 MSK 2016, endTime = Fri Dec 23 11:40:00 MSK 2016 2016-12-23 11:40:26,465 INFO TimelineClusterAggregatorMinute: End aggregation cycle @ Fri Dec 23 11:40:26 MSK 2016 2016-12-23 11:40:26,465 INFO TimelineClusterAggregatorMinute: End aggregation cycle @ Fri Dec 23 11:40:26 MSK 2016 2016-12-23 11:40:46,839 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 25694 actions to finish 2016-12-23 11:40:46,899 INFO TimelineMetricHostAggregatorMinute: 22847 row(s) updated. 2016-12-23 11:40:46,899 INFO TimelineMetricHostAggregatorMinute: Aggregated host metrics for METRIC_RECORD_MINUTE, with startTime = Fri Dec 23 11:35:00 MSK 2016, endTime = Fri Dec 23 11:40:00 MSK 2016 2016-12-23 11:40:46,899 INFO TimelineMetricHostAggregatorMinute: End aggregation cycle @ Fri Dec 23 11:40:46 MSK 2016 2016-12-23 11:40:46,899 INFO TimelineMetricHostAggregatorMinute: End aggregation cycle @ Fri Dec 23 11:40:46 MSK 2016 2016-12-23 11:41:40,503 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 8014 actions to finish
What to do next?
Created 12-25-2016 07:03 PM
Based on the following log statement,
2016-12-2311:40:46,839 INFO org.apache.hadoop.hbase.client.AsyncProcess:#1, waiting for 25694 actions to finish
I think AMS is getting a write load more than it can handle. It could be either because it is a huge cluster and the memory settings are not properly tuned, or because one or more components are sending a huge number of metrics.
Can I have the following ?
1. Number of nodes in the cluster
2. Response to the following GET call
http://<METRICS_COLLECTOR_HOST>:6188/ws/v1/timeline/metrics/metadata http://<METRICS_COLLECTOR_HOST>:6188/ws/v1/timeline/metrics/hosts
3. Following config files in Metrics Collector host -
/etc/ambari-metrics-collector/conf/ ams-env.sh & ams-site.xml /etc/ams-hbase/conf/ hbase-site.xml & hbase-env.sh
Created 12-26-2016 05:49 AM
I have 7 nodes (2 nn + 5 dn).
Response to GET calls
http://<METRICS_COLLECTOR_HOST>:6188/ws/v1/timeline/metrics/metadata (it's too long, I cut it) {"type":"COUNTER","seriesStartTime":1482480880891,"metricname":"regionserver.WAL.rollRequest","supportsAggregation":true},{"type":"COUNTER","seriesStartTime":1482480880869,"metricname":"jvm.Master.JvmMetrics.GcCount","supportsAggregation":true},{"type":"GAUGE","seriesStartTime":1482480880889,"metricname":"master.FileSystem.MetaHlogSplitSize_99th_percentile","supportsAggregation":true},{"type":"GAUGE","seriesStartTime":1482480880889,"metricname":"master.FileSystem.HlogSplitSize_98th_percentile","supportsAggregation":true},{"type":"GAUGE","seriesStartTime":1482480880874,"metricname":"master.Master.QueueCallTime_median","supportsAggregation":true},{"type":"GAUGE","seriesStartTime":1482480880889,"metricname":"master.FileSystem.HlogSplitSize_max","supportsAggregation":true},{"type":"GAUGE","seriesStartTime":1482480880894,"metricname":"master.Balancer.BalancerCluster_median","supportsAggregation":true},{"type":"COUNTER","seriesStartTime":1482480880896,"metricname":"metricssystem.MetricsSystem.PublishNumOps","supportsAggregation":true},{"type":"COUNTER","seriesStartTime":1482480880874,"metricname":"master.Master.exceptions.FailedSanityCheckException","supportsAggregation":true},{"type":"GAUGE","seriesStartTime":1482480880869,"metricname":"jvm.Master.JvmMetrics.MemHeapUsedM","supportsAggregation":true},{"type":"GAUGE","seriesStartTime":1482480880894,"metricname":"master.AssignmentManger.Assign_mean","supportsAggregation":true},{"type":"COUNTER","seriesStartTime":1482480880896,"metricname":"metricssystem.MetricsSystem.Sink_timelineDropped","supportsAggregation":true},{"type":"GAUGE","seriesStartTime":1482480880889,"metricname":"master.FileSystem.MetaHlogSplitTime_95th_percentile","supportsAggregation":true},{"type":"GAUGE","seriesStartTime":1482480880894,"metricname":"master.AssignmentManger.BulkAssign_95th_percentile","supportsAggregation":true},{"type":"GAUGE","seriesStartTime":1482480880869,"metricname":"jvm.Master.JvmMetrics.MemNonHeapUsedM","supportsAggregation":true},{"type":"GAUGE","seriesStartTime":1482480880894,"metricname":"master.AssignmentManger.Assign_99.9th_percentile","supportsAggregation":true},{"type":"GAUGE","seriesStartTime":1482480880874,"metricname":"master.Master.RequestSize_mean","supportsAggregation":true},{"type":"GAUGE","seriesStartTime":1482480880874,"metricname":"master.Master.RequestSize_min","supportsAggregation":true},{"type":"GAUGE","seriesStartTime":1482480880894,"metricname":"master.AssignmentManger.Assign_99th_percentile","supportsAggregation":true},{"type":"GAUGE","seriesStartTime":1482480880891,"metricname":"regionserver.WAL.AppendSize_99th_percentile","supportsAggregation":true},{"type":"GAUGE","seriesStartTime":1482480880894,"metricname":"master.Balancer.BalancerCluster_99.9th_percentile","supportsAggregation":true},{"type":"GAUGE","seriesStartTime":1482480880894,"metricname":"master.AssignmentManger.BulkAssign_75th_percentile","supportsAggregation":true},{"type":"COUNTER","seriesStartTime":1482480880889,"metricname":"master.FileSystem.MetaHlogSplitTime_num_ops","supportsAggregation":true},{"type":"GAUGE","seriesStartTime":1482480880891,"metricname":"regionserver.WAL.SyncTime_90th_percentile","supportsAggregation":true},{"type":"GAUGE","seriesStartTime":1482480880894,"metricname":"master.Balancer.BalancerCluster_90th_percentile","supportsAggregation":true},{"type":"GAUGE","seriesStartTime":1482480880891,"metricname":"regionserver.WAL.AppendTime_max","supportsAggregation":true}],"logfeeder":[{"type":"Long","seriesStartTime":1482480913546,"metricname":"output.solr.write_logs","supportsAggregation":true},{"type":"Long","seriesStartTime":1482480913546,"metricname":"input.files.count","supportsAggregation":true},{"type":"Long","seriesStartTime":1482480943578,"metricname":"filter.error.keyvalue","supportsAggregation":true},{"type":"Long","seriesStartTime":1482480913546,"metricname":"filter.error.grok","supportsAggregation":true},{"type":"Long","seriesStartTime":1482480913546,"metricname":"input.files.read_bytes","supportsAggregation":true},{"type":"Long","seriesStartTime":1482480913546,"metricname":"output.solr.write_bytes","supportsAggregation":true},{"type":"Long","seriesStartTime":1482480913546,"metricname":"input.files.read_lines","supportsAggregation":true}]}
http://<METRICS_COLLECTOR_HOST>:6188/ws/v1/timeline/metrics/hosts {"hdp-dn3.hostname":["accumulo","datanode","journalnode","HOST","nodemanager","hbase","logfeeder"],"hdp-dn5.hostname":["accumulo","datanode","HOST","nodemanager","hbase","logfeeder"],"hdp-dn2.hostname":["accumulo","datanode","HOST","nodemanager","logfeeder"],"hdp-nn1.hostname":["accumulo","nimbus","resourcemanager","journalnode","HOST","applicationhistoryserver","namenode","hbase","kafka_broker","logfeeder"],"hdp-dn1.hostname":["accumulo","hiveserver2","datanode","hivemetastore","HOST","nodemanager","logfeeder"],"hdp-dn4.hostname":["accumulo","datanode","HOST","nodemanager","hbase","logfeeder"],"hdp-nn2.hostname":["hiveserver2","hivemetastore","journalnode","resourcemanager","HOST","jobhistoryserver","namenode","ams-hbase","logfeeder"]}
Config files
cat /etc/ambari-metrics-collector/conf/ams-env.sh # Set environment variables here. # The java implementation to use. Java 1.6 required. export JAVA_HOME=/usr/jdk64/jdk1.8.0_77 # Collector Log directory for log4j export AMS_COLLECTOR_LOG_DIR=/var/log/ambari-metrics-collector # Monitor Log directory for outfile export AMS_MONITOR_LOG_DIR=/var/log/ambari-metrics-monitor # Collector pid directory export AMS_COLLECTOR_PID_DIR=/var/run/ambari-metrics-collector # Monitor pid directory export AMS_MONITOR_PID_DIR=/var/run/ambari-metrics-monitor # AMS HBase pid directory export AMS_HBASE_PID_DIR=/var/run/ambari-metrics-collector/ # AMS Collector heapsize export AMS_COLLECTOR_HEAPSIZE=1024m # HBase normalizer enabled export AMS_HBASE_NORMALIZER_ENABLED=False # HBase compaction policy enabled export AMS_HBASE_FIFO_COMPACTION_ENABLED=True # HBase Tables Initialization check enabled export AMS_HBASE_INIT_CHECK_ENABLED=True # AMS Collector options export AMS_COLLECTOR_OPTS="-Djava.library.path=/usr/lib/ams-hbase/lib/hadoop-native" # AMS Collector GC options export AMS_COLLECTOR_GC_OPTS="-XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/ambari-metrics-collector/collector-gc.log-`date +'%Y%m%d%H%M'`" export AMS_COLLECTOR_OPTS="$AMS_COLLECTOR_OPTS $AMS_COLLECTOR_GC_OPTS"
cat /etc/ambari-metrics-collector/conf/ams-site.xml <configuration> <property> <name>phoenix.query.maxGlobalMemoryPercentage</name> <value>25</value> </property> <property> <name>phoenix.spool.directory</name> <value>/tmp</value> </property> <property> <name>timeline.metrics.aggregator.checkpoint.dir</name> <value>/var/lib/ambari-metrics-collector/checkpoint</value> </property> <property> <name>timeline.metrics.aggregators.skip.blockcache.enabled</name> <value>false</value> </property> <property> <name>timeline.metrics.cache.commit.interval</name> <value>3</value> </property> <property> <name>timeline.metrics.cache.enabled</name> <value>true</value> </property> <property> <name>timeline.metrics.cache.size</name> <value>150</value> </property> <property> <name>timeline.metrics.cluster.aggregate.splitpoints</name> <value>kafka.server.BrokerTopicMetrics.FailedFetchRequestsPerSec.meanRate</value> </property> <property> <name>timeline.metrics.cluster.aggregator.daily.checkpointCutOffMultiplier</name> <value>2</value> </property> <property> <name>timeline.metrics.cluster.aggregator.daily.disabled</name> <value>false</value> </property> <property> <name>timeline.metrics.cluster.aggregator.daily.interval</name> <value>86400</value> </property> <property> <name>timeline.metrics.cluster.aggregator.daily.ttl</name> <value>63072000</value> </property> <property> <name>timeline.metrics.cluster.aggregator.hourly.checkpointCutOffMultiplier</name> <value>2</value> </property> <property> <name>timeline.metrics.cluster.aggregator.hourly.disabled</name> <value>false</value> </property> <property> <name>timeline.metrics.cluster.aggregator.hourly.interval</name> <value>3600</value> </property> <property> <name>timeline.metrics.cluster.aggregator.hourly.ttl</name> <value>31536000</value> </property> <property> <name>timeline.metrics.cluster.aggregator.interpolation.enabled</name> <value>true</value> </property> <property> <name>timeline.metrics.cluster.aggregator.minute.checkpointCutOffMultiplier</name> <value>2</value> </property> <property> <name>timeline.metrics.cluster.aggregator.minute.disabled</name> <value>false</value> </property> <property> <name>timeline.metrics.cluster.aggregator.minute.interval</name> <value>300</value> </property> <property> <name>timeline.metrics.cluster.aggregator.minute.ttl</name> <value>2592000</value> </property> <property> <name>timeline.metrics.cluster.aggregator.second.checkpointCutOffMultiplier</name> <value>2</value> </property> <property> <name>timeline.metrics.cluster.aggregator.second.disabled</name> <value>false</value> </property> <property> <name>timeline.metrics.cluster.aggregator.second.interval</name> <value>120</value> </property> <property> <name>timeline.metrics.cluster.aggregator.second.timeslice.interval</name> <value>30</value> </property> <property> <name>timeline.metrics.cluster.aggregator.second.ttl</name> <value>259200</value> </property> <property> <name>timeline.metrics.daily.aggregator.minute.interval</name> <value>86400</value> </property> <property> <name>timeline.metrics.hbase.compression.scheme</name> <value>SNAPPY</value> </property> <property> <name>timeline.metrics.hbase.data.block.encoding</name> <value>FAST_DIFF</value> </property> <property> <name>timeline.metrics.hbase.fifo.compaction.enabled</name> <value>true</value> </property> <property> <name>timeline.metrics.hbase.init.check.enabled</name> <value>true</value> </property> <property> <name>timeline.metrics.host.aggregate.splitpoints</name> <value>kafka.server.BrokerTopicMetrics.FailedFetchRequestsPerSec.meanRate</value> </property> <property> <name>timeline.metrics.host.aggregator.daily.checkpointCutOffMultiplier</name> <value>2</value> </property> <property> <name>timeline.metrics.host.aggregator.daily.disabled</name> <value>false</value> </property> <property> <name>timeline.metrics.host.aggregator.daily.ttl</name> <value>31536000</value> </property> <property> <name>timeline.metrics.host.aggregator.hourly.checkpointCutOffMultiplier</name> <value>2</value> </property> <property> <name>timeline.metrics.host.aggregator.hourly.disabled</name> <value>false</value> </property> <property> <name>timeline.metrics.host.aggregator.hourly.interval</name> <value>3600</value> </property> <property> <name>timeline.metrics.host.aggregator.hourly.ttl</name> <value>2592000</value> </property> <property> <name>timeline.metrics.host.aggregator.minute.checkpointCutOffMultiplier</name> <value>2</value> </property> <property> <name>timeline.metrics.host.aggregator.minute.disabled</name> <value>false</value> </property> <property> <name>timeline.metrics.host.aggregator.minute.interval</name> <value>300</value> </property> <property> <name>timeline.metrics.host.aggregator.minute.ttl</name> <value>604800</value> </property> <property> <name>timeline.metrics.host.aggregator.ttl</name> <value>86400</value> </property> <property> <name>timeline.metrics.service.checkpointDelay</name> <value>60</value> </property> <property> <name>timeline.metrics.service.cluster.aggregator.appIds</name> <value>datanode,nodemanager,hbase</value> </property> <property> <name>timeline.metrics.service.default.result.limit</name> <value>15840</value> </property> <property> <name>timeline.metrics.service.handler.thread.count</name> <value>20</value> </property> <property> <name>timeline.metrics.service.http.policy</name> <value>HTTP_ONLY</value> </property> <property> <name>timeline.metrics.service.operation.mode</name> <value>distributed</value> </property> <property> <name>timeline.metrics.service.resultset.fetchSize</name> <value>2000</value> </property> <property> <name>timeline.metrics.service.rpc.address</name> <value>0.0.0.0:60200</value> </property> <property> <name>timeline.metrics.service.use.groupBy.aggregators</name> <value>true</value> </property> <property> <name>timeline.metrics.service.watcher.delay</name> <value>30</value> </property> <property> <name>timeline.metrics.service.watcher.disabled</name> <value>true</value> </property> <property> <name>timeline.metrics.service.watcher.initial.delay</name> <value>600</value> </property> <property> <name>timeline.metrics.service.watcher.timeout</name> <value>30</value> </property> <property> <name>timeline.metrics.service.webapp.address</name> <value>hdp-nn2.hostname:6188</value> </property> <property> <name>timeline.metrics.sink.collection.period</name> <value>10</value> </property> <property> <name>timeline.metrics.sink.report.interval</name> <value>60</value> </property> </configuration>
cat /etc/ams-hbase/conf/hbase-site.xml <configuration> <property> <name>dfs.block.access.token.enable</name> <value>true</value> </property> <property> <name>dfs.blockreport.initialDelay</name> <value>120</value> </property> <property> <name>dfs.blocksize</name> <value>134217728</value> </property> <property> <name>dfs.client.failover.proxy.provider.prodcluster</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property> <property> <name>dfs.client.read.shortcircuit</name> <value>true</value> </property> <property> <name>dfs.client.read.shortcircuit.streams.cache.size</name> <value>4096</value> </property> <property> <name>dfs.client.retry.policy.enabled</name> <value>false</value> </property> <property> <name>dfs.cluster.administrators</name> <value> hdfs</value> </property> <property> <name>dfs.content-summary.limit</name> <value>5000</value> </property> <property> <name>dfs.datanode.address</name> <value>0.0.0.0:50010</value> </property> <property> <name>dfs.datanode.balance.bandwidthPerSec</name> <value>6250000</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/hdfs/hadoop/hdfs/data</value> <final>true</final> </property> <property> <name>dfs.datanode.data.dir.perm</name> <value>750</value> </property> <property> <name>dfs.datanode.du.reserved</name> <value>65906998272</value> </property> <property> <name>dfs.datanode.failed.volumes.tolerated</name> <value>0</value> <final>true</final> </property> <property> <name>dfs.datanode.http.address</name> <value>0.0.0.0:50075</value> </property> <property> <name>dfs.datanode.https.address</name> <value>0.0.0.0:50475</value> </property> <property> <name>dfs.datanode.ipc.address</name> <value>0.0.0.0:8010</value> </property> <property> <name>dfs.datanode.max.transfer.threads</name> <value>16384</value> </property> <property> <name>dfs.domain.socket.path</name> <value>/var/lib/hadoop-hdfs/dn_socket</value> </property> <property> <name>dfs.encrypt.data.transfer.cipher.suites</name> <value>AES/CTR/NoPadding</value> </property> <property> <name>dfs.ha.automatic-failover.enabled</name> <value>true</value> </property> <property> <name>dfs.ha.fencing.methods</name> <value>shell(/bin/true)</value> </property> <property> <name>dfs.ha.namenodes.prodcluster</name> <value>nn1,nn2</value> </property> <property> <name>dfs.heartbeat.interval</name> <value>3</value> </property> <property> <name>dfs.hosts.exclude</name> <value>/etc/hadoop/conf/dfs.exclude</value> </property> <property> <name>dfs.http.policy</name> <value>HTTP_ONLY</value> </property> <property> <name>dfs.https.port</name> <value>50470</value> </property> <property> <name>dfs.internal.nameservices</name> <value>prodcluster</value> </property> <property> <name>dfs.journalnode.edits.dir</name> <value>/hadoop/hdfs/journal</value> </property> <property> <name>dfs.journalnode.http-address</name> <value>0.0.0.0:8480</value> </property> <property> <name>dfs.journalnode.https-address</name> <value>0.0.0.0:8481</value> </property> <property> <name>dfs.namenode.accesstime.precision</name> <value>0</value> </property> <property> <name>dfs.namenode.audit.log.async</name> <value>true</value> </property> <property> <name>dfs.namenode.avoid.read.stale.datanode</name> <value>true</value> </property> <property> <name>dfs.namenode.avoid.write.stale.datanode</name> <value>true</value> </property> <property> <name>dfs.namenode.checkpoint.dir</name> <value>/hdfs/hadoop/hdfs/namesecondary</value> </property> <property> <name>dfs.namenode.checkpoint.edits.dir</name> <value>${dfs.namenode.checkpoint.dir}</value> </property> <property> <name>dfs.namenode.checkpoint.period</name> <value>21600</value> </property> <property> <name>dfs.namenode.checkpoint.txns</name> <value>1000000</value> </property> <property> <name>dfs.namenode.fslock.fair</name> <value>false</value> </property> <property> <name>dfs.namenode.handler.count</name> <value>600</value> </property> <property> <name>dfs.namenode.http-address.prodcluster.nn1</name> <value>hdp-nn1.hostname:50070</value> </property> <property> <name>dfs.namenode.http-address.prodcluster.nn2</name> <value>hdp-nn2.hostname:50070</value> </property> <property> <name>dfs.namenode.https-address.prodcluster.nn1</name> <value>hdp-nn1.hostname:50470</value> </property> <property> <name>dfs.namenode.https-address.prodcluster.nn2</name> <value>hdp-nn2.hostname:50470</value> </property> <property> <name>dfs.namenode.inode.attributes.provider.class</name> <value>org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>/hdfs/hadoop/hdfs/namenode</value> <final>true</final> </property> <property> <name>dfs.namenode.name.dir.restore</name> <value>true</value> </property> <property> <name>dfs.namenode.rpc-address.prodcluster.nn1</name> <value>hdp-nn1.hostname:8020</value> </property> <property> <name>dfs.namenode.rpc-address.prodcluster.nn2</name> <value>hdp-nn2.hostname:8020</value> </property> <property> <name>dfs.namenode.safemode.threshold-pct</name> <value>0.99</value> </property> <property> <name>dfs.namenode.shared.edits.dir</name> <value>qjournal://hdp-dn3.hostname:8485;hdp-nn1.hostname:8485;hdp-nn2.hostname:8485/prodcluster</value> </property> <property> <name>dfs.namenode.stale.datanode.interval</name> <value>30000</value> </property> <property> <name>dfs.namenode.startup.delay.block.deletion.sec</name> <value>3600</value> </property> <property> <name>dfs.namenode.write.stale.datanode.ratio</name> <value>1.0f</value> </property> <property> <name>dfs.nameservices</name> <value>prodcluster</value> </property> <property> <name>dfs.permissions.enabled</name> <value>true</value> </property> <property> <name>dfs.permissions.superusergroup</name> <value>hdfs</value> </property> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.replication.max</name> <value>50</value> </property> <property> <name>dfs.support.append</name> <value>true</value> <final>true</final> </property> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> <final>true</final> </property> <property> <name>fs.permissions.umask-mode</name> <value>077</value> </property> <property> <name>nfs.exports.allowed.hosts</name> <value>* rw</value> </property> <property> <name>nfs.file.dump.dir</name> <value>/tmp/.hdfs-nfs</value> </property> </configuration>
cat /etc/ams-hbase/conf/hbase-env.sh # Set environment variables here. # The java implementation to use. Java 1.6+ required. export JAVA_HOME=/usr/jdk64/jdk1.8.0_77 # HBase Configuration directory export HBASE_CONF_DIR=${HBASE_CONF_DIR:-/etc/ams-hbase/conf} # Extra Java CLASSPATH elements. Optional. additional_cp= if [ -n "$additional_cp" ]; then export HBASE_CLASSPATH=${HBASE_CLASSPATH}:$additional_cp else export HBASE_CLASSPATH=${HBASE_CLASSPATH} fi # The maximum amount of heap to use for hbase shell. export HBASE_SHELL_OPTS="-Xmx256m" # Extra Java runtime options. # Below are what we set by default. May only work with SUN JVM. # For more on why as well as other possible settings, # see http://wiki.apache.org/hadoop/PerformanceTuning export HBASE_OPTS="-XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/ambari-metrics-collector/hs_err_pid%p.log -Djava.io.tmpdir=/var/lib/ambari-metrics-collector/hbase-tmp" export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/ambari-metrics-collector/gc.log-`date +'%Y%m%d%H%M'`" # Uncomment below to enable java garbage collection logging. # export HBASE_OPTS="$HBASE_OPTS -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:$HBASE_HOME/logs/gc-hbase.log" # Uncomment and adjust to enable JMX exporting # See jmxremote.password and jmxremote.access in $JRE_HOME/lib/management to configure remote password access. # More details at: http://java.sun.com/javase/6/docs/technotes/guides/management/agent.html # # export HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false" export HBASE_MASTER_OPTS=" -Xms512m -Xmx512m -Xmn102m -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly" export HBASE_REGIONSERVER_OPTS=" -Xmn128m -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -Xms896m -Xmx896m" # export HBASE_THRIFT_OPTS="$HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10103" # export HBASE_ZOOKEEPER_OPTS="$HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10104" # File naming hosts on which HRegionServers will run. $HBASE_HOME/conf/regionservers by default. export HBASE_REGIONSERVERS=${HBASE_CONF_DIR}/regionservers # Extra ssh options. Empty by default. # export HBASE_SSH_OPTS="-o ConnectTimeout=1 -o SendEnv=HBASE_CONF_DIR" # Where log files are stored. $HBASE_HOME/logs by default. export HBASE_LOG_DIR=/var/log/ambari-metrics-collector # A string representing this instance of hbase. $USER by default. # export HBASE_IDENT_STRING=$USER # The scheduling priority for daemon processes. See 'man nice'. # export HBASE_NICENESS=10 # The directory where pid files are stored. /tmp by default. export HBASE_PID_DIR=/var/run/ambari-metrics-collector/ # Seconds to sleep between slave commands. Unset by default. This # can be useful in large clusters, where, e.g., slave rsyncs can # otherwise arrive faster than the master can service them. # export HBASE_SLAVE_SLEEP=0.1 # Tell HBase whether it should manage it's own instance of Zookeeper or not. export HBASE_MANAGES_ZK=false # use embedded native libs _HADOOP_NATIVE_LIB="/usr/lib/ams-hbase/lib/hadoop-native/" export HBASE_OPTS="$HBASE_OPTS -Djava.library.path=${_HADOOP_NATIVE_LIB}" # Unset HADOOP_HOME to avoid importing HADOOP installed cluster related configs like: /usr/hdp/2.2.0.0-2041/hadoop/conf/ export HADOOP_HOME=/usr/lib/ams-hbase/ # Explicitly Setting HBASE_HOME for AMS HBase so that there is no conflict export HBASE_HOME=/usr/lib/ams-hbase/
rpm -qa | grep ambari ambari-metrics-collector-2.4.0.1-1.x86_64 ambari-metrics-hadoop-sink-2.4.0.1-1.x86_64 ambari-agent-2.4.0.1-1.x86_64 ambari-infra-solr-client-2.4.0.1-1.x86_64 ambari-logsearch-logfeeder-2.4.0.1-1.x86_64 ambari-metrics-monitor-2.4.0.1-1.x86_64 ambari-metrics-grafana-2.4.0.1-1.x86_64 ambari-infra-solr-2.4.0.1-1.x86_64
Created 12-27-2016 05:00 AM
Yesterday suddenly Ambari Metrics has started working (and still works). The only thing I have changed yesterday - install Apache Atlas, which required restart almost all components, may be it helped. Thanks for your assistance!