Community Articles

Find and share helpful community-sourced technical articles.
Celebrating as our community reaches 100,000 members! Thank you!
Labels (1)
Cloudera Employee

I didn't want to clear the ATS and lose all the job history. I wanted to fix the corruption and preserve the ATS leveldb entries.

2017-06-15 19:17:43,871 INFO service.AbstractService ( - Service org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer failed in state INITED; cause: org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 43 missing files; e.g.: /data/hadoop/ats/leveldb/leveldb-timeline-store/domain-ldb/000015.sst 
2017-06-15 19:17:43,871 INFO impl.MetricsSystemImpl ( - Stopping ApplicationHistoryServer metrics system... 
2017-06-15 19:17:43,873 INFO impl.MetricsSinkAdapter ( - timeline thread interrupted. 
2017-06-15 19:17:43,875 INFO impl.MetricsSystemImpl ( - ApplicationHistoryServer metrics system stopped. 
2017-06-15 19:17:43,875 INFO impl.MetricsSystemImpl ( - ApplicationHistoryServer metrics system shutdown complete. 
2017-06-15 19:17:43,876 FATAL applicationhistoryservice.ApplicationHistoryServer ( - Error starting ApplicationHistoryServer 
2017-06-15 19:17:43,877 INFO util.ExitUtil ( - Exiting with status -1 
2017-06-15 19:17:43,880 INFO applicationhistoryservice.ApplicationHistoryServer ( - SHUTDOWN_MSG: 
SHUTDOWN_MSG: Shutting down ApplicationHistoryServer

This post on HCC was helpful: ATS issue.

What wasn't obvious from that post is that there could be more than one leveldb "partition" (my term) corrupted.

In my case, there was a corruption of the following which required these remedial steps ...

I had to remove each of the following CURRENT files:


I kept copies of the CURRENT files in /tmp/leveldbissue like this:

cd <dir where the leveldb files were reporting missing> 
mkdir /tmp/leveldbissue
cp -ip CURRENT /tmp/leveldbissue/xxxx-ldb (where xxxx is the deepest dir where the leveldb files were reporting missing) 

Each time a corrupted leveldb files were found, do the above and restart the ATS (via Ambari) and iterate until no more xxxxx-ldb/.ldb files reporting 'corruption'.

Here are the files at the end of my iterations through 'corruptions'.

$ cd /tmp/leveldbissue
$ ls -alt CURR* 
-rw-r--r-- 1 root root 16 Jun 15 20:28 CURRENT.starttime-ldb 
-rw-r--r-- 1 yarn hadoop 16 Apr 13 04:51 CURRENT.timeline-state-store.ldb 
-rw-r--r-- 1 yarn hadoop 16 Apr 13 04:51 CURRENT.owner-ldb 
-rw-r--r-- 1 yarn hadoop 16 Apr 13 04:48 CURRENT.domain-ldb 

The process was fairly painless though the "recovery" process on ATS restart after removing the CURRENT files did take some time for the busy cluster I was working on at the time. If downtime is more of a concern than preserving the ATS job history, you could consider clearing the ATS data.

Hope this helps - not a nice one to get in the small hours of the morning when you are on your own.