Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark Streaming application shutdown when Kafka brokers are restarted

Spark Streaming application shutdown when Kafka brokers are restarted

New Contributor

I have spark streaming application that uses direct streaming listening to KAFKA topic.

    HashMap<String, String> kafkaParams = new HashMap<String, String>();
    kafkaParams.put("metadata.broker.list", "broker1,broker2,broker3");
    kafkaParams.put("auto.offset.reset", "largest");

    HashSet<String> topicsSet = new HashSet<String>();
    topicsSet.add("Topic1");

    JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(
            jssc, 
            String.class, 
            String.class,
            StringDecoder.class, 
            StringDecoder.class, 
            kafkaParams, 
            topicsSet
    );

I notice when i stop/shutdown kafka brokers, my spark application also shutdown.

Here is the spark execution script

spark-submit \
--master yarn-cluster \
--files /home/siddiquf/spark/log4j-spark.xml
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j-spark.xml" \
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j-spark.xml" \
--class com.example.MyDataStreamProcessor \
myapp.jar 

Spark job submitted successfully and i can track the application driver and worker/executor nodes.

Everything works fine but only concern if kafka borkers are offline or restarted my application controlled by yarn should not shutdown? but it does.

If this is expected behavior then how to handle such situation with least maintenance? Keeping in mind Kafka cluster is not in hadoop cluster and managed by different team that is why requires our application to be resilient enough.

Thanks

1 REPLY 1
Highlighted

Re: Spark Streaming application shutdown when Kafka brokers are restarted

Master Collaborator
If the brokers are entirely unavailable, I believe it would fail the
job, yes. The job can't read any data anyway, so it can't proceed
anyway. I think you'd just have to restart the job if your cluster
were offline completely (or try to negotiate more reliability from the
other team if that's really the problem?)
Don't have an account?
Coming from Hortonworks? Activate your account here