Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)
New Contributor

I didn't want to clear the ATS and lose all the job history. I wanted to fix the corruption and preserve the ATS leveldb entries.

2017-06-15 19:17:43,871 INFO service.AbstractService (AbstractService.java:noteFailure(272)) - Service org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer failed in state INITED; cause: org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 43 missing files; e.g.: /data/hadoop/ats/leveldb/leveldb-timeline-store/domain-ldb/000015.sst 
<snipped>
2017-06-15 19:17:43,871 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(211)) - Stopping ApplicationHistoryServer metrics system... 
2017-06-15 19:17:43,873 INFO impl.MetricsSinkAdapter (MetricsSinkAdapter.java:publishMetricsFromQueue(141)) - timeline thread interrupted. 
2017-06-15 19:17:43,875 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(217)) - ApplicationHistoryServer metrics system stopped. 
2017-06-15 19:17:43,875 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:shutdown(605)) - ApplicationHistoryServer metrics system shutdown complete. 
2017-06-15 19:17:43,876 FATAL applicationhistoryservice.ApplicationHistoryServer (ApplicationHistoryServer.java:launchAppHistoryServer(171)) - Error starting ApplicationHistoryServer 
<snipped>
2017-06-15 19:17:43,877 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status -1 
2017-06-15 19:17:43,880 INFO applicationhistoryservice.ApplicationHistoryServer (LogAdapter.java:info(45)) - SHUTDOWN_MSG: 
/************************************************************ 
SHUTDOWN_MSG: Shutting down ApplicationHistoryServer

This post on HCC was helpful: ATS issue.

What wasn't obvious from that post is that there could be more than one leveldb "partition" (my term) corrupted.

In my case, there was a corruption of the following which required these remedial steps ...

I had to remove each of the following CURRENT files:

/data/hadoop/ats/leveldb/leveldb-timeline-store/domain-ldb/CURRENT 
/data/hadoop/ats/leveldb/leveldb-timeline-store/starttime-ldb/CURRENT 
/data/hadoop/ats/leveldb/leveldb-timeline-store/owner-ldb/CURRENT 
/data/hadoop/yarn/timeline/timeline-state-store.ldb/CURRENT 

I kept copies of the CURRENT files in /tmp/leveldbissue like this:

cd <dir where the leveldb files were reporting missing> 
mkdir /tmp/leveldbissue
cp -ip CURRENT /tmp/leveldbissue/xxxx-ldb (where xxxx is the deepest dir where the leveldb files were reporting missing) 
rm CURRENT 

Each time a corrupted leveldb files were found, do the above and restart the ATS (via Ambari) and iterate until no more xxxxx-ldb/.ldb files reporting 'corruption'.

Here are the files at the end of my iterations through 'corruptions'.

$ cd /tmp/leveldbissue
$ ls -alt CURR* 
-rw-r--r-- 1 root root 16 Jun 15 20:28 CURRENT.starttime-ldb 
-rw-r--r-- 1 yarn hadoop 16 Apr 13 04:51 CURRENT.timeline-state-store.ldb 
-rw-r--r-- 1 yarn hadoop 16 Apr 13 04:51 CURRENT.owner-ldb 
-rw-r--r-- 1 yarn hadoop 16 Apr 13 04:48 CURRENT.domain-ldb 

The process was fairly painless though the "recovery" process on ATS restart after removing the CURRENT files did take some time for the busy cluster I was working on at the time. If downtime is more of a concern than preserving the ATS job history, you could consider clearing the ATS data.

Hope this helps - not a nice one to get in the small hours of the morning when you are on your own.

994 Views
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
1 of 1
Last update:
‎06-30-2017 11:09 PM
Updated by:
 
Contributors
Top Kudoed Authors