08-09-2017 08:28 AM
Hi Team, we have a requirement to process up to 1 million messages per minute. We need to ingest these messages to HDFS in parquet format and expose them using impala tables. We have considered below options -
1. Use the Flafka model with Kafka sink - Here, flume source will consume messages from a queue, write to memory channel and then to Kafka sink. Then Spark streaming will consume messages from Kafka directly. The problem with this approach is that we may loose messages if Flume agent crashes while messages are still in the memory channel.
2. Use the Flafka model with Kafka as channel - Here, we are using jms source, Kafka channel (instead of memory or file channel) and Spark sink to achieve the high throughput. The concern we have with this architecture is that how we will replay the message using Kafka in case of sink failures as the plumbing is done by the Flume and we dont have control over Kafka. Moreover, is Flume utilizing the Kafka partition and providing the parallelism that we will get using Kafka seperately?
3. Use a java client to write to Kafka - Here, we will write our own java client to write to Kafka and do spark streaming direct out of Kafka. This will add more development time and a new component that we will have to maintain.
Can you please provide your thoughts and whether my understanding is correct?