Instead of using groupbykey ,use reducebyKey.the reduceByKey example works much better on a large dataset. That's because Spark knows it can combine output with a common key on each partition before shuffling the data. val wordCountsWithReduce = wordPairsRDD
.reduceByKey(_ + _)
val wordCountsWithGroup = wordPairsRDD
.map(t => (t._1, t._2.sum))
... View more
The offset management is one of the most important activity of kafka.
Offset is the critical value that enables consumer to read position from last read within a partition and topic. Last committed Offset : When individual is reading from partition,so it has to know about what it has and has not read. So this is called last committed offset.So it has the confirmation about the last message already consumed. So at any given point of time a topics can have multiple partition and consumer can have multiple last committed offset depending upon partitions. Current position:As consumer reads new records ,it also has information about current position as it is reading new records. There is a big difference between last commited offset and current offset. There are lot of uncommited offset in between last commited offset and current offset. You can call then as 'un-commited' offset.
Depending upon application developer ,it is required to manage offset enable.auto.commit = true makes kakfa the responsibility to manage when current position is committed and made as last committed offset. this is a default setting and kafka doesn't know under what circumstance kafka will commit the records.
The only thing can be done is to establish commit interval time using auto.commit.interval= 1000 .
But there is a challenge in above setting. If your application processing takes more time than auto.commit.interval time then there is a commit issue and kafka will commit wrong message offset ,which may not have processed due to some issue. To avoid this kind of issue kafka allows user to manage the consumer offset. First you have to make enable.auto.commit = false
commitSync and commitAsync are two method is used to manage offset manually.
... View more