Community Articles

gloureiro · ‎06-30-2017

I didn't want to clear the ATS and lose all the job history. I wanted to fix the corruption and preserve the ATS leveldb entries.

2017-06-15 19:17:43,871 INFO service.AbstractService (AbstractService.java:noteFailure(272)) - Service org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer failed in state INITED; cause: org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 43 missing files; e.g.: /data/hadoop/ats/leveldb/leveldb-timeline-store/domain-ldb/000015.sst 
<snipped>
2017-06-15 19:17:43,871 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(211)) - Stopping ApplicationHistoryServer metrics system... 
2017-06-15 19:17:43,873 INFO impl.MetricsSinkAdapter (MetricsSinkAdapter.java:publishMetricsFromQueue(141)) - timeline thread interrupted. 
2017-06-15 19:17:43,875 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(217)) - ApplicationHistoryServer metrics system stopped. 
2017-06-15 19:17:43,875 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:shutdown(605)) - ApplicationHistoryServer metrics system shutdown complete. 
2017-06-15 19:17:43,876 FATAL applicationhistoryservice.ApplicationHistoryServer (ApplicationHistoryServer.java:launchAppHistoryServer(171)) - Error starting ApplicationHistoryServer 
<snipped>
2017-06-15 19:17:43,877 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status -1 
2017-06-15 19:17:43,880 INFO applicationhistoryservice.ApplicationHistoryServer (LogAdapter.java:info(45)) - SHUTDOWN_MSG: 
/************************************************************ 
SHUTDOWN_MSG: Shutting down ApplicationHistoryServer

This post on HCC was helpful: ATS issue.

What wasn't obvious from that post is that there could be more than one leveldb "partition" (my term) corrupted.

In my case, there was a corruption of the following which required these remedial steps ...

I had to remove each of the following CURRENT files:

/data/hadoop/ats/leveldb/leveldb-timeline-store/domain-ldb/CURRENT 
/data/hadoop/ats/leveldb/leveldb-timeline-store/starttime-ldb/CURRENT 
/data/hadoop/ats/leveldb/leveldb-timeline-store/owner-ldb/CURRENT 
/data/hadoop/yarn/timeline/timeline-state-store.ldb/CURRENT

I kept copies of the CURRENT files in /tmp/leveldbissue like this:

cd <dir where the leveldb files were reporting missing> 
mkdir /tmp/leveldbissue
cp -ip CURRENT /tmp/leveldbissue/xxxx-ldb (where xxxx is the deepest dir where the leveldb files were reporting missing) 
rm CURRENT

Each time a corrupted leveldb files were found, do the above and restart the ATS (via Ambari) and iterate until no more xxxxx-ldb/.ldb files reporting 'corruption'.

Here are the files at the end of my iterations through 'corruptions'.

$ cd /tmp/leveldbissue
$ ls -alt CURR* 
-rw-r--r-- 1 root root 16 Jun 15 20:28 CURRENT.starttime-ldb 
-rw-r--r-- 1 yarn hadoop 16 Apr 13 04:51 CURRENT.timeline-state-store.ldb 
-rw-r--r-- 1 yarn hadoop 16 Apr 13 04:51 CURRENT.owner-ldb 
-rw-r--r-- 1 yarn hadoop 16 Apr 13 04:48 CURRENT.domain-ldb

The process was fairly painless though the "recovery" process on ATS restart after removing the CURRENT files did take some time for the busy cluster I was working on at the time. If downtime is more of a concern than preserving the ATS job history, you could consider clearing the ATS data.

Hope this helps - not a nice one to get in the small hours of the morning when you are on your own.

Cloudera Community

Community Articles

Application Timeline Server (ATS) leveldb corruption issue

Apache YARN

Application Timeline Server (ATS) issue error code...

Application timeline server is crashing

Timeline server down

App Timeline Server not start

Max number of applications shown in timeline serve...

HA for History Server and App Timeline Server

Issue running spark application in Yarn-cluster mo...

YARN: cannot start App Timeline Server

corrupted block issue..i have 100+ corrupted block...

Creating a CDE Job with Spark Application Code loc...