Support Questions

Find answers, ask questions, and share your expertise

What is difference between Flume and Kafka?

Comparison between Flume and Kafka?


Both Flume & Kafka are used for real-time event processing but they are quite different from each other as per below mentioned points: 1. Kafka is a general purpose publish-subscribe model messaging system. It is not specifically designed for Hadoop as hadoop ecosystem just acts as one of its possible consumer. On the other hand flume is a part of Hadoop ecosystem , which is used for efficiently collecting, aggregating, and moving large amounts of data from many different sources to a centralized data store, such as HDFS or HBase. It is more tightly integrated with Hadoop ecosystem. Ex, the flume HDFS sink integrates with the HDFS security very well. So its common use case is to act as a data pipeline to ingest data into Hadoop. 2. It is very easy to increase the number of consumers in kafka without affecting its performance & without any downtime. Also it does not keep any track of messages in the topic delivered to consumers. Although it is the consumer’s responsibility to do the tracking of data through offset. Hence it is very scalable contrary to flume as adding more consumers in the flume means changing the topology of Flume pipeline design, which requires some downtime also. 3. Kafka is basically working as a pull model. kafka different consumers can pull data from their respective topic at same time as consumer can process their data in real-time as well as batch mode. On the contrary flume supports push model as there may be a chances of getting data loss if consumer does not recover their data expeditly. 4. Kafka supports both synchronous and asynchronous replication based on your durability requirement and it uses commodity hard drive. Flume supports both ephemeral memory-based channel and durable file-based channel. Even when you use a durable file-based channel, any event stored in a channel not yet written to a sink will be unavailable until the agent is recovered. Moreover, the file-based channel does not replicate event data to a different node. It totally depends on the durability of the storage it writes upon. 5. For Kafka we need to write our own producer and consumer but in case of flume, it uses built-in sources and sinks, which can be used out of box. That’s why if flume agent failure occurs then we lose events in the channel.

6. Kafka always needs to integrate with other event processing framework, that’s why it does not provide native support for message processing In contrast, Flume supports different data flow models and interceptors chaining, which makes event filtering and transforming very easy. For example, you can filter out messages that you are not interested in the pipeline first before sending it through the network for obvious performance reason. However, It is not suitable for complex event processing.