We're experiencing periodic crashing of our Ambari Metrics Collector.
This only started happening after upgrading from HDP 2.2.8 to 2.3.2.
There are some warnings in the logs that look interesting: "TimelineClusterAggregatorSecond:131 - Last Checkpoint is too
old, discarding last checkpoint"
Can anyone provide some guidance on how we might better tune our Metrics Collector service to improve this?
10:30:06,487 WARN [pool-7-thread-1] TimelineClusterAggregatorSecond:131 - Last Checkpoint is too old, discarding last checkpoint. lastCheckPointTime = Fri May 27 10:24:00 EDT 2016 10:30:06,487 INFO [pool-7-thread-1] TimelineClusterAggregatorSecond:134 - Saving checkpoint time. Fri May 27 10:28:00 EDT 2016 10:30:06,487 INFO [Thread-4] ZooKeeper:684 - Session: 0x254c9555ca7f744 closed 10:30:06,487 INFO [main-EventThread] ClientCnxn:524 - EventThread shut down 10:30:06,488 INFO [pool-7-thread-1] TimelineClusterAggregatorSecond:106 - Last check point time: 1464359280000, lagBy: 126 seconds. 10:30:06,488 INFO [pool-7-thread-1] TimelineClusterAggregatorSecond:211 - Start aggregation cycle @ Fri May 27 10:30:06 EDT 2016, startTime = Fri May 27 10:28:00 EDT 2016, endTime = Fri May 27 10:30:00 EDT 2016 10:31:56,528 INFO [main] ApplicationHistoryServer:45 - STARTUP_MSG:
Zack, what version of Ambari is this?
Do you see a a SIGTERM 15 everytime before the AMS crashes and restarts?
If yes, please take a look at my recommendation here - https://community.hortonworks.com/questions/23232/ambari-metrics-randomly-restart.html#answer-26700