Created 12-11-2015 02:11 AM
Hi,
I'm using HDP 2.3.0 and Yarn app timeline server is failing periodically. Checking app timeline server log, the cause is due to GC overhead limit exceeded.
2015-12-02 12:48:56,548 ERROR mortbay.log (Slf4jLog.java:warn(87)) - /ws/v1/timeline/spark_event_v01 java.lang.OutOfMemoryError: GC overhead limit exceeded at org.codehaus.jackson.util.TextBuffer.contentsAsString(TextBuffer.java:350) at org.codehaus.jackson.impl.Utf8StreamParser.getText(Utf8StreamParser.java:278) at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:59) at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.mapObject(UntypedObjectDeserializer.java:204) at org.codehaus.jackson.map.deser.std.UntypedObjectDeserializer.deserialize(UntypedObjectDeserializer.java:47) at org.codehaus.jackson.map.ObjectReader._bindAndClose(ObjectReader.java:768) at org.codehaus.jackson.map.ObjectReader.readValue(ObjectReader.java:486) at org.apache.hadoop.yarn.server.timeline.GenericObjectMapper.read(GenericObjectMapper.java:93) at org.apache.hadoop.yarn.server.timeline.GenericObjectMapper.read(GenericObjectMapper.java:77) at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.getEntityEvent(LeveldbTimelineStore.java:1188) at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.getEntity(LeveldbTimelineStore.java:437) at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.getEntityByTime(LeveldbTimelineStore.java:685) at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.getEntities(LeveldbTimelineStore.java:557) at org.apache.hadoop.yarn.server.timeline.TimelineDataManager.getEntities(TimelineDataManager.java:134) at org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.getEntities(TimelineWebServices.java:119) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
It seems that the timeline server fails to delete old Leveldb data so every time it must load a large volume of old entries which cause GC overhead. Checking the log there is a lot of lines like the following:
2015-12-02 12:48:14,471 WARN timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:deleteNextEntity(1459)) - Found no start time for reverse related entity tez_appattempt_1447379225800_23982_000001 of type TEZ_APPLICATION_ATTEMPT while deleting dag_1447379225800_23982_1 of type TEZ_DAG_ID 2015-12-02 12:48:14,471 WARN timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:deleteNextEntity(1459)) - Found no start time for reverse related entity tez_appattempt_1447379225800_23982_000001 of type TEZ_APPLICATION_ATTEMPT while deleting dag_1447379225800_23982_1 of type TEZ_DAG_ID
And checking the volume of the timeline data folder gives the following info:
40K timeline/timeline-state-store.ldb 7.0G timeline/leveldb-timeline-store.ldb 7.0G timeline 3.4G timeline-data/leveldb-timeline-store.ldb 3.4G timeline-data
With app timeline server failing, currently I cannot see the history of my Spark Jobs. Any help is appreciated.
Created 12-12-2015 02:01 PM
This looks like it's being triggered by the Spark -> timeline server integration, as ATS is going OOM when handling spark events.
Which means its my code running in the spark jobs triggering this.
What kind of jobs are you running? Short lived? Long-lived? Many executors?
The best short-term fix is for you to disable the timeline server integration, and set the spark applications up to log to HDFS instead, with the history server reading it from there.
The details of this are covered in Spark Monitoring
1. In the spark job configuration you need to disable the ATS publishing.
Find the line
spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService
-delete it
set the property spark.history.fs.logDirectory to an HDFS directory which must be writeable by everyone. For example, hdfs://shared/logfiles
spark.eventLog.enabled
true
spark.eventLog.compress
truespark.history.fs.logDirectoryhdfs://shared/logfiles
2. In the history server you need to switch to the filesystem log provider
spark.history.providerorg.apache.spark.deploy.history.FsHistoryProviderspark.history.fs.logDirectoryhdfs://shared/logfiles
The next spark release we'll have up for download (soon!) will log less events to the timeline server. Hopefully it will reduce the problems on the timeline server. There's also lots of work going on in the timeline server for future hadoop versions to handle larger amounts of data —by mixing stuff kept in HDFS with the leveldb data.
for now, switching to the filesystem provider is your best bet
Created 12-12-2015 02:53 AM
Please check memory utilization while running the operations.
java.lang.OutOfMemoryError: GC overhead limit exceeded
http://stackoverflow.com/questions/5839359/java-lang-outofmemoryerror-gc-overhead-limit-exceeded
Created 12-12-2015 02:01 PM
This looks like it's being triggered by the Spark -> timeline server integration, as ATS is going OOM when handling spark events.
Which means its my code running in the spark jobs triggering this.
What kind of jobs are you running? Short lived? Long-lived? Many executors?
The best short-term fix is for you to disable the timeline server integration, and set the spark applications up to log to HDFS instead, with the history server reading it from there.
The details of this are covered in Spark Monitoring
1. In the spark job configuration you need to disable the ATS publishing.
Find the line
spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService
-delete it
set the property spark.history.fs.logDirectory to an HDFS directory which must be writeable by everyone. For example, hdfs://shared/logfiles
spark.eventLog.enabled
true
spark.eventLog.compress
truespark.history.fs.logDirectoryhdfs://shared/logfiles
2. In the history server you need to switch to the filesystem log provider
spark.history.providerorg.apache.spark.deploy.history.FsHistoryProviderspark.history.fs.logDirectoryhdfs://shared/logfiles
The next spark release we'll have up for download (soon!) will log less events to the timeline server. Hopefully it will reduce the problems on the timeline server. There's also lots of work going on in the timeline server for future hadoop versions to handle larger amounts of data —by mixing stuff kept in HDFS with the leveldb data.
for now, switching to the filesystem provider is your best bet
Created 01-08-2016 04:33 PM
Can I add that there's now a preview of Spark 1.6 on HDP: this one shouldn't overload the timeline server
Created 03-24-2016 08:16 PM
Further details on this.