I have spark streaming application that uses direct streaming listening to KAFKA topic.
HashMap<String, String> kafkaParams = new HashMap<String, String>(); kafkaParams.put("metadata.broker.list", "broker1,broker2,broker3"); kafkaParams.put("auto.offset.reset", "largest"); HashSet<String> topicsSet = new HashSet<String>(); topicsSet.add("Topic1"); JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream( jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet );
I notice when i stop/shutdown kafka brokers, my spark application also shutdown.
Here is the spark execution script
spark-submit \ --master yarn-cluster \ --files /home/siddiquf/spark/log4j-spark.xml --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j-spark.xml" \ --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j-spark.xml" \ --class com.example.MyDataStreamProcessor \ myapp.jar
Spark job submitted successfully and i can track the application driver and worker/executor nodes.
Everything works fine but only concern if kafka borkers are offline or restarted my application controlled by yarn should not shutdown? but it does.
If this is expected behavior then how to handle such situation with least maintenance? Keeping in mind Kafka cluster is not in hadoop cluster and managed by different team that is why requires our application to be resilient enough.