I have my messages in a KAFKA topic with a single partition.
I need to process them in SPARK and I am using KafkaUtils.CreateStream and
getting my messages in SPARK.My messages in KAFKA comes with “Network_id” and
probably there are 10k dinstict network_id’s.I am creating my micro-batch
streaming context for 1 second interval.I should process my messages in the
order it comes and for sure I should not miss my order.For every micro batch of
1 second there is a possibility that I get distinct network_id’s.I should
process them in the order they arrive.
Is there any chance that I miss my order during
the creation of streams from Kafka to spark? If it misses-Then it is a direct
requirement mismatch of my requirement.
RDD1 is the RDD created at the timeframe of 1st
second and RDD2 is the RDD created at the timeframe of 2nd second.
Will there be any chance that the RDD2 [which is created later than RDD1] be
executed before RDD1? Or is there any chance of any of the messages in RDD2
gets executed before the completion of the execution of the messages in RDD1?
In order to not miss my order: Should I process
my messages groupByKey(network_id) and apply repartition function based on
count of distinct network_id’s? If I do this will there be any chance that I might
miss my order again?