Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark Streaming/HBase: Micro Batch times increase substantially without any change in processing volume

Highlighted

Spark Streaming/HBase: Micro Batch times increase substantially without any change in processing volume

New Contributor

I have a Spark Streaming process that reads data from a small subset of tables in HBase and writes the data out to a different set of tables. The batch window is 120 seconds long. When I start the process, the processing times are great - 25-30 seconds. For seven hours the processing times are in that same range. But after a time (seven hours at present), the processing times shoot up to 3.5 minutes and stay steady there for at least 30 minutes (I haven't let it run longer). If I stop the process and rerun it, the processing times fall back into that same 25-30 second window.

Where do I look to debug this?

I've checked the Spark UI logs, but none of the executors are showing exceptions when the run profile changes. I'm not seeing compactions on HBase during that window. The process also resumes acceptable run times when I restart it. I don't think it has anything to do with memory or space as it continues to run and the processing time spike is sudden and then sustained.

9 REPLIES 9

Re: Spark Streaming/HBase: Micro Batch times increase substantially without any change in processing volume

Super Collaborator

Which version of Spark / hbase are you using ?

How much was the duration between you stopped the process and restarted it ?

Are you able to capture stack trace of executor when the spike was observed ?

Thanks

Re: Spark Streaming/HBase: Micro Batch times increase substantially without any change in processing volume

New Contributor

We are using Spark 1.4.1 with HDP 2.3.2. The duration between the time I stopped the process and when I restarted it was within the same minute or so (it takes a several seconds between the kill being issued and the process ending). I'm not seeing anything in the executor logs off of the Spark UI.

Re: Spark Streaming/HBase: Micro Batch times increase substantially without any change in processing volume

Super Collaborator

To determine the source of the delay (Spark or hbase), when spike happens, you can use a separate program to measure the read / write latencies to the same set of tables.

If the latencies are on-par with what you observe at the beginning of your process start, we can rule out hbase being the source of the spike.

Re: Spark Streaming/HBase: Micro Batch times increase substantially without any change in processing volume

@Adam Doyle

Did you enabled the backpressure if not then can you try it? may be spark is not able to process the batch within the time window.

spark.streaming.backpressure.enabled=true

Does same streaming job working fine in past or this is a new deployment? How about doing JVM profiling of spark executors and see what exactly happening?

Re: Spark Streaming/HBase: Micro Batch times increase substantially without any change in processing volume

New Contributor

We are using Spark 1.4.1 - so no backpressure. This is a new deployment. I have not done a JVM profile of the executors. I'm not exactly sure how to do that.

Re: Spark Streaming/HBase: Micro Batch times increase substantially without any change in processing volume

Here is the detail for profiling.

From: http://spark.apache.org/docs/latest/monitoring.html

Several external tools can be used to help profile the performance of Spark jobs:

  • Cluster-wide monitoring tools, such as Ganglia, can provide insight into overall cluster utilization and resource bottlenecks. For instance, a Ganglia dashboard can quickly reveal whether a particular workload is disk bound, network bound, or CPU bound.
  • OS profiling tools such as dstat, iostat, and iotop can provide fine-grained profiling on individual nodes.
  • JVM utilities such as jstack for providing stack traces, jmap for creating heap-dumps, jstat for reporting time-series statistics and jconsolefor visually exploring various JVM properties are useful for those comfortable with JVM internals.

Re: Spark Streaming/HBase: Micro Batch times increase substantially without any change in processing volume

Super Guru

Have you checked the Spark History, Spark logs. I am guessing a memory leak.

What time is your StreamingContext? Every 2 minutes as well?

Also how big are the hbase tables? Are you grabbing the entire data?

This might be a job for NIFI.

Re: Spark Streaming/HBase: Micro Batch times increase substantially without any change in processing volume

New Contributor

I am not seeing any messages about the process being Out of Memory. I'm not sure what you mean by "time is your Streaming Context". The context is defined as below:

final JavaStreamingContext jsc = new JavaStreamingContext(conf, Durations.seconds(STREAM_DURATION_IN_SECS));

The Hbase tables are relatively big, but I'm not ingesting the whole table, but rather scanning a custom change data capture table for the latest entries and then getting the data out of the tables that were updated.

Re: Spark Streaming/HBase: Micro Batch times increase substantially without any change in processing volume

New Contributor

Adam,

A memory leak doesn't always show a OOME (out of memory) error. What could be happening is that you are having a lot of GC pauses which causes application performance issues. I would suggest a quick peak at how the GC is behaving. Here is what you should do.

  1. Find the PID: ps -eaf | grep java (or spark). Get the linux id of the Spark process.
  2. Monitor heap: jstat -gcutil <pid> 2s - will output GC stats every 2 seconds
    1. If jstat says it can't find the process id, then the process was likely started by another user than the one you are running jstat with. So you will need to sudo -s jstat -gcutil <pid> 2s
    2. You may want to pipe it to a file so you can view over time
      1. nohup jstat -gcutil <pid> 10s > gc.log &
        1. don't need 2s when piping to file
  • Review output: Look at the "O" column (old gen which is where long lived objects take up residency), the "FCG" column (number of full GC pauses; stop the world), and "FGCT" (the amount of time the JVM is paused during FGC's)
    1. What you are looking for is the "O" column rising towards 75-100. Once it gets that high, you will see a FCG increase by one and FCGT will increase by the time it took to clean up object (pause the world).
    2. If you see long pauses (>5 seconds) and the "O" column is cleaned up as mentioned above, then GC tuning is required.
    3. If you see back to back FGC's (keeps trying to clean up and the "O" doesn't decrease much).
    1. These are signs of a memory leak.
  • If you have a memory leak, that is another dissertation/contact me and I will help you solve it.
  • I hope this is useful.

    Eric