Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

What is expected behavior of spark.streaming.backpressure.pid.minRate property

Highlighted

What is expected behavior of spark.streaming.backpressure.pid.minRate property

New Contributor

I have spark streaming application that reads messages from Kafka using Spark Direct Streaming (not receiver) approach and process messages per partition in yarn cluster mode.

 

In my Kafka partition, sometime we get the messages that take 20 seconds to process 2000 messages and

some of the messages takes 7-9 seconds for same no. of messages.

 

Given the fluctuation, we turned on the back pressure settings as follows.

 

spark.batch.duration=10 seconds
spark.streaming.kafka.maxRatePerPartition=200

spark.streaming.backpressure.enabled=true
spark.streaming.backpressure.initialRate=60
spark.streaming.kafka.maxRatePerPartition=200
spark.streaming.backpressure.pid.minRate=1600

and also specified RateEstimator with following parameters. I don't understand the mathematics of PID but tried different combination and one of them as follows.

 

spark.streaming.backpressure.rateEstimator=pid
spark.streaming.backpressure.pid.minRate=1600
spark.streaming.backpressure.pid.integral=1
spark.streaming.backpressure.pid.proportional=25
spark.streaming.backpressure.pid.derived=1

Initially, spark reads the 2000 messages for 1 partition in RDD but after some time it start reading 800 records. that i think is minRate/2. and then it stays static.. In the logs, it always print 1600 as new rate.

2017-01-20 14:55:14 TRACE PIDRateEstimator:67 - New rate = 1600.0

Given my scenario, i have few questions:

  1. spark.streaming.backpressure.pid.minRate is per partition or total no. of messages to be read by batch?
  2. Why reading 800 messages instead 1600 as defined in spark.streaming.backpressure.pid.minRate ??
  3. Any suggested parameters that reduce the input rate when processing takes long and increase back to something close to maxRatePerPartition when processing is very fast? In my example, input rate started with 2000 but when it took long like 20 seconds average, it reduced it to 800 but when 800 messages processed in 3-4 seconds it didn't increase it back to something 1600 or more. This results waste of time and low throughput.

6I3N5AonG0