Member since
12-09-2015
1
Post
1
Kudos Received
0
Solutions
12-11-2015
02:11 AM
1 Kudo
Hi, I'm using HDP 2.3.0 and Yarn app timeline server is failing periodically. Checking app timeline server log, the cause is due to GC overhead limit exceeded. 2015-12-02 12:48:56,548 ERROR mortbay.log (Slf4jLog.java:warn(87)) - /ws/v1/timeline/spark_event_v01
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.codehaus.jackson.util.TextBuffer.contentsAsString(TextBuffer.java:350)
at org.codehaus.jackson.impl.Utf8StreamParser.getText(Utf8StreamParser.java:278)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:59)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.mapObject(UntypedObjectDeserializer.java:204)
at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:47)
at org.codehaus.jackson.map.ObjectReader._bindAndClose(ObjectReader.java:768)
at org.codehaus.jackson.map.ObjectReader.readValue(ObjectReader.java:486)
at org.apache.hadoop.yarn.server.timeline.GenericObjectMapper.read(GenericObjectMapper.java:93)
at org.apache.hadoop.yarn.server.timeline.GenericObjectMapper.read(GenericObjectMapper.java:77)
at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.getEntityEvent(LeveldbTimelineStore.java:1188)
at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.getEntity(LeveldbTimelineStore.java:437)
at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.getEntityByTime(LeveldbTimelineStore.java:685)
at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.getEntities(LeveldbTimelineStore.java:557)
at org.apache.hadoop.yarn.server.timeline.TimelineDataManager.getEntities(TimelineDataManager.java:134)
at org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.getEntities(TimelineWebServices.java:119)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) It seems that the timeline server fails to delete old Leveldb data so every time it must load a large volume of old entries which cause GC overhead. Checking the log there is a lot of lines like the following: 2015-12-02 12:48:14,471 WARN timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:deleteNextEntity(1459)) - Found no start time for reverse related entity tez_appattempt_1447379225800_23982_000001 of type TEZ_APPLICATION_ATTEMPT while deleting dag_1447379225800_23982_1 of type TEZ_DAG_ID
2015-12-02 12:48:14,471 WARN timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:deleteNextEntity(1459)) - Found no start time for reverse related entity tez_appattempt_1447379225800_23982_000001 of type TEZ_APPLICATION_ATTEMPT while deleting dag_1447379225800_23982_1 of type TEZ_DAG_ID
And checking the volume of the timeline data folder gives the following info: 40K timeline/timeline-state-store.ldb
7.0G timeline/leveldb-timeline-store.ldb
7.0G timeline
3.4G timeline-data/leveldb-timeline-store.ldb
3.4G timeline-data
With app timeline server failing, currently I cannot see the history of my Spark Jobs. Any help is appreciated.
... View more
Labels: