Reply
New Contributor
Posts: 6
Registered: ‎08-02-2015

Spark Streaming Job submit through Envelope end with OutOfMemory Exception

[ Edited ]

We are using envelope to develop Spark Streaming Job. While after long running, the job will end with an driver OutOfMemory Exception. From the heap dump file, we can see most of the memory was occupied by the following instance: org$apache$spark$sql$execution$CacheManager$$cachedData which was like the cache of SQL operations. Could anyone help me solve this problem? Thank you so much.
172599_after_gc.png

Cloudera Employee
Posts: 461
Registered: ‎08-11-2014

Re: Spark Streaming Job submit through Envelope end with OutOfMemory Exception

You're simply running out of memory. A large portion of memory is dedicated to caching data in Spark, of course, and that explains why a lot of memory has cached data. That's not necessarily the issue here. You may be retaining state on lots of jobs in the driver and that's eating memory in the driver (wasn't clear whether that's the heap you're showing). You can just incease memory, or look for ways to reduce memory usage.

New Contributor
Posts: 6
Registered: ‎08-02-2015

Re: Spark Streaming Job submit through Envelope end with OutOfMemory Exception

Hello Srowen, thank you for your reply. I had already tried to increase the driver memory, from 4g to 7g. However, the increased driver memory only extend the job's time to OOM. After discussing this issue with Cloudera Support, the engineer thought the problem may be caused by some design issue of Envelope and suggest me to seek help on this board.
Cloudera Employee
Posts: 21
Registered: ‎08-26-2015

Re: Spark Streaming Job submit through Envelope end with OutOfMemory Exception

For Envelope specifically, it does eagerly cache data, so if you have a streaming job with many steps then that might eventually cause this problem.

 

You can stop a step from being cached by adding "cache = false", e.g.:

 

steps {
  ...
  step_name_here {
    dependencies = ...
    cache = false
    ...
  }
  ...
}

Likely in the next version we will change the default to not cache a step unless it is configured to do so.

 

- Jeremy

 

Cloudera Employee
Posts: 461
Registered: ‎08-11-2014

Re: Spark Streaming Job submit through Envelope end with OutOfMemory Exception

Caching is OK in that Spark won't use more than it's allowed for caching, and you can turn that fraction down, if your app is heavily using memory for other things. Have a look at spark.memory.fraction or spark.memory.storageFraction. However that is only the issue if you're running out of memory on executors.

 

If you're running out of driver memory try retaining a lot fewer job history details. Turn spark.ui.retained{Jobs,Stages,Tasks} way down to reduce that memory consumption.

 

But the answer may simply be that you need more memory. I don't see evidence that 7G is necessarily enough, depending on what you are doing.

New Contributor
Posts: 6
Registered: ‎08-02-2015

Re: Spark Streaming Job submit through Envelope end with OutOfMemory Exception

Hello Jeremy,
Thank you for your suggestion. I will try it now. If there are any progress, i will post it here.
Highlighted
New Contributor
Posts: 6
Registered: ‎08-02-2015

Re: Spark Streaming Job submit through Envelope end with OutOfMemory Exception

Thanks, srowen. The job is simply consuming from kafka, then output to kudu when meeting specific filter condition.
I think the suggestion of turn spark.ui.retained{Jobs,Stages, Tasks} will be help. I will try it later.
New Contributor
Posts: 6
Registered: ‎08-02-2015

Re: Spark Streaming Job submit through Envelope end with OutOfMemory Exception

Hello Jeremy, One more thing, after adding 'cache = false' on the step, how can i confirm this parameter has taken effect? Thank you for your help.
Cloudera Employee
Posts: 21
Registered: ‎08-26-2015

Re: Spark Streaming Job submit through Envelope end with OutOfMemory Exception

If you put that on every step then you shouldn't see any entries in the Storage tab of the Spark UI for the job.

New Contributor
Posts: 6
Registered: ‎08-02-2015

Re: Spark Streaming Job submit through Envelope end with OutOfMemory Exception

Thank you, Jeremy. After setting "cache = false" on each step, i can still see RDD entries on the Storage tab of the Spark UI for the job. I set the parameter like the following way: steps { step1_input { cache = false ..... } step2_load { dependencies = [step1_input] cache = fasle ...... } } Is that correct?
Announcements

Currently incubating in Cloudera Labs:

Envelope
HTrace
Ibis
Impyla
Livy
Oryx
Phoenix
Spark Runner for Beam SDK
Time Series for Spark
YCSB