Community Articles
Find and share helpful community-sourced technical articles.
Labels (1)
Cloudera Employee

I didn't want to clear the ATS and lose all the job history. I wanted to fix the corruption and preserve the ATS leveldb entries.

2017-06-15 19:17:43,871 INFO service.AbstractService (AbstractService.java:noteFailure(272)) - Service org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer failed in state INITED; cause: org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 43 missing files; e.g.: /data/hadoop/ats/leveldb/leveldb-timeline-store/domain-ldb/000015.sst 
<snipped>
2017-06-15 19:17:43,871 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(211)) - Stopping ApplicationHistoryServer metrics system... 
2017-06-15 19:17:43,873 INFO impl.MetricsSinkAdapter (MetricsSinkAdapter.java:publishMetricsFromQueue(141)) - timeline thread interrupted. 
2017-06-15 19:17:43,875 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(217)) - ApplicationHistoryServer metrics system stopped. 
2017-06-15 19:17:43,875 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:shutdown(605)) - ApplicationHistoryServer metrics system shutdown complete. 
2017-06-15 19:17:43,876 FATAL applicationhistoryservice.ApplicationHistoryServer (ApplicationHistoryServer.java:launchAppHistoryServer(171)) - Error starting ApplicationHistoryServer 
<snipped>
2017-06-15 19:17:43,877 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status -1 
2017-06-15 19:17:43,880 INFO applicationhistoryservice.ApplicationHistoryServer (LogAdapter.java:info(45)) - SHUTDOWN_MSG: 
/************************************************************ 
SHUTDOWN_MSG: Shutting down ApplicationHistoryServer

This post on HCC was helpful: ATS issue.

What wasn't obvious from that post is that there could be more than one leveldb "partition" (my term) corrupted.

In my case, there was a corruption of the following which required these remedial steps ...

I had to remove each of the following CURRENT files:

/data/hadoop/ats/leveldb/leveldb-timeline-store/domain-ldb/CURRENT 
/data/hadoop/ats/leveldb/leveldb-timeline-store/starttime-ldb/CURRENT 
/data/hadoop/ats/leveldb/leveldb-timeline-store/owner-ldb/CURRENT 
/data/hadoop/yarn/timeline/timeline-state-store.ldb/CURRENT 

I kept copies of the CURRENT files in /tmp/leveldbissue like this:

cd <dir where the leveldb files were reporting missing> 
mkdir /tmp/leveldbissue
cp -ip CURRENT /tmp/leveldbissue/xxxx-ldb (where xxxx is the deepest dir where the leveldb files were reporting missing) 
rm CURRENT 

Each time a corrupted leveldb files were found, do the above and restart the ATS (via Ambari) and iterate until no more xxxxx-ldb/.ldb files reporting 'corruption'.

Here are the files at the end of my iterations through 'corruptions'.

$ cd /tmp/leveldbissue
$ ls -alt CURR* 
-rw-r--r-- 1 root root 16 Jun 15 20:28 CURRENT.starttime-ldb 
-rw-r--r-- 1 yarn hadoop 16 Apr 13 04:51 CURRENT.timeline-state-store.ldb 
-rw-r--r-- 1 yarn hadoop 16 Apr 13 04:51 CURRENT.owner-ldb 
-rw-r--r-- 1 yarn hadoop 16 Apr 13 04:48 CURRENT.domain-ldb 

The process was fairly painless though the "recovery" process on ATS restart after removing the CURRENT files did take some time for the busy cluster I was working on at the time. If downtime is more of a concern than preserving the ATS job history, you could consider clearing the ATS data.

Hope this helps - not a nice one to get in the small hours of the morning when you are on your own.

1,695 Views