Created on 06-30-2017 11:09 PM
I didn't want to clear the ATS and lose all the job history. I wanted to fix the corruption and preserve the ATS leveldb entries.
2017-06-15 19:17:43,871 INFO service.AbstractService (AbstractService.java:noteFailure(272)) - Service org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer failed in state INITED; cause: org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 43 missing files; e.g.: /data/hadoop/ats/leveldb/leveldb-timeline-store/domain-ldb/000015.sst <snipped> 2017-06-15 19:17:43,871 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(211)) - Stopping ApplicationHistoryServer metrics system... 2017-06-15 19:17:43,873 INFO impl.MetricsSinkAdapter (MetricsSinkAdapter.java:publishMetricsFromQueue(141)) - timeline thread interrupted. 2017-06-15 19:17:43,875 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(217)) - ApplicationHistoryServer metrics system stopped. 2017-06-15 19:17:43,875 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:shutdown(605)) - ApplicationHistoryServer metrics system shutdown complete. 2017-06-15 19:17:43,876 FATAL applicationhistoryservice.ApplicationHistoryServer (ApplicationHistoryServer.java:launchAppHistoryServer(171)) - Error starting ApplicationHistoryServer <snipped> 2017-06-15 19:17:43,877 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status -1 2017-06-15 19:17:43,880 INFO applicationhistoryservice.ApplicationHistoryServer (LogAdapter.java:info(45)) - SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down ApplicationHistoryServer
This post on HCC was helpful: ATS issue.
What wasn't obvious from that post is that there could be more than one leveldb "partition" (my term) corrupted.
In my case, there was a corruption of the following which required these remedial steps ...
I had to remove each of the following CURRENT files:
/data/hadoop/ats/leveldb/leveldb-timeline-store/domain-ldb/CURRENT /data/hadoop/ats/leveldb/leveldb-timeline-store/starttime-ldb/CURRENT /data/hadoop/ats/leveldb/leveldb-timeline-store/owner-ldb/CURRENT /data/hadoop/yarn/timeline/timeline-state-store.ldb/CURRENT
I kept copies of the CURRENT files in /tmp/leveldbissue like this:
cd <dir where the leveldb files were reporting missing> mkdir /tmp/leveldbissue cp -ip CURRENT /tmp/leveldbissue/xxxx-ldb (where xxxx is the deepest dir where the leveldb files were reporting missing) rm CURRENT
Each time a corrupted leveldb files were found, do the above and restart the ATS (via Ambari) and iterate until no more xxxxx-ldb/.ldb files reporting 'corruption'.
Here are the files at the end of my iterations through 'corruptions'.
$ cd /tmp/leveldbissue $ ls -alt CURR* -rw-r--r-- 1 root root 16 Jun 15 20:28 CURRENT.starttime-ldb -rw-r--r-- 1 yarn hadoop 16 Apr 13 04:51 CURRENT.timeline-state-store.ldb -rw-r--r-- 1 yarn hadoop 16 Apr 13 04:51 CURRENT.owner-ldb -rw-r--r-- 1 yarn hadoop 16 Apr 13 04:48 CURRENT.domain-ldb
The process was fairly painless though the "recovery" process on ATS restart after removing the CURRENT files did take some time for the busy cluster I was working on at the time. If downtime is more of a concern than preserving the ATS job history, you could consider clearing the ATS data.
Hope this helps - not a nice one to get in the small hours of the morning when you are on your own.