I have a Spark Streaming process that reads data from a small subset of tables in HBase and writes the data out to a different set of tables. The batch window is 120 seconds long. When I start the process, the processing times are great - 25-30 seconds. For seven hours the processing times are in that same range. But after a time (seven hours at present), the processing times shoot up to 3.5 minutes and stay steady there for at least 30 minutes (I haven't let it run longer). If I stop the process and rerun it, the processing times fall back into that same 25-30 second window.
Where do I look to debug this?
I've checked the Spark UI logs, but none of the executors are showing exceptions when the run profile changes. I'm not seeing compactions on HBase during that window. The process also resumes acceptable run times when I restart it. I don't think it has anything to do with memory or space as it continues to run and the processing time spike is sudden and then sustained.
Which version of Spark / hbase are you using ?
How much was the duration between you stopped the process and restarted it ?
Are you able to capture stack trace of executor when the spike was observed ?
We are using Spark 1.4.1 with HDP 2.3.2. The duration between the time I stopped the process and when I restarted it was within the same minute or so (it takes a several seconds between the kill being issued and the process ending). I'm not seeing anything in the executor logs off of the Spark UI.
To determine the source of the delay (Spark or hbase), when spike happens, you can use a separate program to measure the read / write latencies to the same set of tables.
If the latencies are on-par with what you observe at the beginning of your process start, we can rule out hbase being the source of the spike.
Did you enabled the backpressure if not then can you try it? may be spark is not able to process the batch within the time window.
Does same streaming job working fine in past or this is a new deployment? How about doing JVM profiling of spark executors and see what exactly happening?
We are using Spark 1.4.1 - so no backpressure. This is a new deployment. I have not done a JVM profile of the executors. I'm not exactly sure how to do that.
Here is the detail for profiling.
Several external tools can be used to help profile the performance of Spark jobs:
jstackfor providing stack traces,
jmapfor creating heap-dumps,
jstatfor reporting time-series statistics and
jconsolefor visually exploring various JVM properties are useful for those comfortable with JVM internals.
Have you checked the Spark History, Spark logs. I am guessing a memory leak.
What time is your StreamingContext? Every 2 minutes as well?
Also how big are the hbase tables? Are you grabbing the entire data?
This might be a job for NIFI.
I am not seeing any messages about the process being Out of Memory. I'm not sure what you mean by "time is your Streaming Context". The context is defined as below:
final JavaStreamingContext jsc = new JavaStreamingContext(conf, Durations.seconds(STREAM_DURATION_IN_SECS));
The Hbase tables are relatively big, but I'm not ingesting the whole table, but rather scanning a custom change data capture table for the latest entries and then getting the data out of the tables that were updated.
A memory leak doesn't always show a OOME (out of memory) error. What could be happening is that you are having a lot of GC pauses which causes application performance issues. I would suggest a quick peak at how the GC is behaving. Here is what you should do.
I hope this is useful.