Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

To preserve the Order of my messages in the Spark-Kafka streams of RDD

Highlighted

To preserve the Order of my messages in the Spark-Kafka streams of RDD

I have my messages in a KAFKA topic with a single partition. I need to process them in SPARK and I am using KafkaUtils.CreateStream and getting my messages in SPARK.My messages in KAFKA comes with “Network_id” and probably there are 10k dinstict network_id’s.I am creating my micro-batch streaming context for 1 second interval.I should process my messages in the order it comes and for sure I should not miss my order.For every micro batch of 1 second there is a possibility that I get distinct network_id’s.I should process them in the order they arrive.

My questions:

  1. Is there any chance that I miss my order during the creation of streams from Kafka to spark? If it misses-Then it is a direct requirement mismatch of my requirement.
  2. RDD1 is the RDD created at the timeframe of 1st second and RDD2 is the RDD created at the timeframe of 2nd second. Will there be any chance that the RDD2 [which is created later than RDD1] be executed before RDD1? Or is there any chance of any of the messages in RDD2 gets executed before the completion of the execution of the messages in RDD1?
  3. In order to not miss my order: Should I process my messages groupByKey(network_id) and apply repartition function based on count of distinct network_id’s? If I do this will there be any chance that I might miss my order again?
Don't have an account?
Coming from Hortonworks? Activate your account here