Support Questions
Find answers, ask questions, and share your expertise

Yarn timeline server periodically fails

New Contributor

Hi,

I'm using HDP 2.3.0 and Yarn app timeline server is failing periodically. Checking app timeline server log, the cause is due to GC overhead limit exceeded.

2015-12-02 12:48:56,548 ERROR mortbay.log (Slf4jLog.java:warn(87)) - /ws/v1/timeline/spark_event_v01
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at org.codehaus.jackson.util.TextBuffer.contentsAsString(TextBuffer.java:350)
        at org.codehaus.jackson.impl.Utf8StreamParser.getText(Utf8StreamParser.java:278)
        at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:59)
        at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.mapObject(UntypedObjectDeserializer.java:204)
        at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:47)
        at org.codehaus.jackson.map.ObjectReader._bindAndClose(ObjectReader.java:768)
        at org.codehaus.jackson.map.ObjectReader.readValue(ObjectReader.java:486)
        at org.apache.hadoop.yarn.server.timeline.GenericObjectMapper.read(GenericObjectMapper.java:93)
        at org.apache.hadoop.yarn.server.timeline.GenericObjectMapper.read(GenericObjectMapper.java:77)
        at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.getEntityEvent(LeveldbTimelineStore.java:1188)
        at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.getEntity(LeveldbTimelineStore.java:437)
        at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.getEntityByTime(LeveldbTimelineStore.java:685)
        at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.getEntities(LeveldbTimelineStore.java:557)
        at org.apache.hadoop.yarn.server.timeline.TimelineDataManager.getEntities(TimelineDataManager.java:134)
        at org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.getEntities(TimelineWebServices.java:119)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
        at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
        at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
        at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
        at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
        at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
        at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
        at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
        at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
        at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
        at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
        at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
        at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)

It seems that the timeline server fails to delete old Leveldb data so every time it must load a large volume of old entries which cause GC overhead. Checking the log there is a lot of lines like the following:

2015-12-02 12:48:14,471 WARN  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:deleteNextEntity(1459)) - Found no start time for reverse related entity tez_appattempt_1447379225800_23982_000001 of type TEZ_APPLICATION_ATTEMPT while deleting dag_1447379225800_23982_1 of type TEZ_DAG_ID
2015-12-02 12:48:14,471 WARN  timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:deleteNextEntity(1459)) - Found no start time for reverse related entity tez_appattempt_1447379225800_23982_000001 of type TEZ_APPLICATION_ATTEMPT while deleting dag_1447379225800_23982_1 of type TEZ_DAG_ID

And checking the volume of the timeline data folder gives the following info:

40K     timeline/timeline-state-store.ldb
7.0G    timeline/leveldb-timeline-store.ldb
7.0G    timeline
3.4G    timeline-data/leveldb-timeline-store.ldb
3.4G    timeline-data

With app timeline server failing, currently I cannot see the history of my Spark Jobs. Any help is appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions

This looks like it's being triggered by the Spark -> timeline server integration, as ATS is going OOM when handling spark events.

Which means its my code running in the spark jobs triggering this.

What kind of jobs are you running? Short lived? Long-lived? Many executors?

The best short-term fix is for you to disable the timeline server integration, and set the spark applications up to log to HDFS instead, with the history server reading it from there.

The details of this are covered in Spark Monitoring

1. In the spark job configuration you need to disable the ATS publishing.

Find the line

spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService

-delete it

set the property spark.history.fs.logDirectory to an HDFS directory which must be writeable by everyone. For example, hdfs://shared/logfiles

spark.eventLog.enabled

true

spark.eventLog.compress

truespark.history.fs.logDirectoryhdfs://shared/logfiles

2. In the history server you need to switch to the filesystem log provider

spark.history.providerorg.apache.spark.deploy.history.FsHistoryProviderspark.history.fs.logDirectoryhdfs://shared/logfiles

The next spark release we'll have up for download (soon!) will log less events to the timeline server. Hopefully it will reduce the problems on the timeline server. There's also lots of work going on in the timeline server for future hadoop versions to handle larger amounts of data —by mixing stuff kept in HDFS with the leveldb data.

for now, switching to the filesystem provider is your best bet

View solution in original post

4 REPLIES 4

@Linh Tran

Please check memory utilization while running the operations.

java.lang.OutOfMemoryError: GC overhead limit exceeded

http://stackoverflow.com/questions/5839359/java-lang-outofmemoryerror-gc-overhead-limit-exceeded

This looks like it's being triggered by the Spark -> timeline server integration, as ATS is going OOM when handling spark events.

Which means its my code running in the spark jobs triggering this.

What kind of jobs are you running? Short lived? Long-lived? Many executors?

The best short-term fix is for you to disable the timeline server integration, and set the spark applications up to log to HDFS instead, with the history server reading it from there.

The details of this are covered in Spark Monitoring

1. In the spark job configuration you need to disable the ATS publishing.

Find the line

spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService

-delete it

set the property spark.history.fs.logDirectory to an HDFS directory which must be writeable by everyone. For example, hdfs://shared/logfiles

spark.eventLog.enabled

true

spark.eventLog.compress

truespark.history.fs.logDirectoryhdfs://shared/logfiles

2. In the history server you need to switch to the filesystem log provider

spark.history.providerorg.apache.spark.deploy.history.FsHistoryProviderspark.history.fs.logDirectoryhdfs://shared/logfiles

The next spark release we'll have up for download (soon!) will log less events to the timeline server. Hopefully it will reduce the problems on the timeline server. There's also lots of work going on in the timeline server for future hadoop versions to handle larger amounts of data —by mixing stuff kept in HDFS with the leveldb data.

for now, switching to the filesystem provider is your best bet

View solution in original post

Can I add that there's now a preview of Spark 1.6 on HDP: this one shouldn't overload the timeline server

Further details on this.

Configuring the Spark History Server to Use HDFS

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_installing_manually_book/content/config-...