Support Questions
Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

Real-Time Data Ingestion for mission critical applications;

Explorer

Our use case ;

In some text files coming data from some mission critical applications,these are not click stream data or something like that.

We have to catch every row without data losing.

At the beginning, daily approximately 15,000,000 rows expected.

30,000 rows/minute.

Somehow we have to use kafka to store data.

Some consumers take data from kafka topics and than write to hbase or phoenix.Here is clear for us.

The most important thing is all rows in these text files must be readed anyway.

Question 1.

Which solution is best practice for that ?

1. Flume & Kafka ?

2. Spark streaming & Kafka ?

3. Only Spark streaming ?

4. Storm & Kafka ?

5. Flume --> to hbase or phoenix ?

6. any other solutions ?

Question 2.

Can we use best practice solution with Nifi ?

Thanks in advance,

1 ACCEPTED SOLUTION

@Faruk Berksoz

Kafka - YES for all scenarios. Kafka is not for storing. Kafka is for transport. Your data still needs to land somewhere, e.g. As you mentioned that is HBase via Phoenix, but it could also be HDFS or Hive.

1. Yes. Flume is ok for ingest, but you still need something else to post to Kafka (Kafka Producer), e.g. KafkaConnect.

2. No. Spark Streaming is appropriate for consumer applications, not really for your use case which is about ingest and post to Kafka.

3. No. Same response as for #2

4. No. Storm is for appropriate for consumer applications, not really for your use case which is about ingest and post to Kafka.

5. Could work, not recommended.

The most common architectures are:

a) Flume-> KafkaConnect-> Kafka; consumer applications are built using either Storm or Spark Streaming. Other options are available, but less used.

b) Nifi -> Kafka -> Storm; consumer applications are built using Storm; this is Hortonworks DataFlow stack

c) Others (Attunity, Syncsort) -> Kafka -> consumer applications built in Storm or Spark Streaming

Since I am biased, I would say go with b) - Storm or Spark Streaming, or both. I'm saying that only because I am biased but because each of the components scale amazingly and because I used Flume before and don't want to go back there once I've seen what I can achieve with NiFi. Additionally, HDF will evolve an integrated platform for stream analytics with visual definition of flows and analytics requiring the least programming. You will be amazed of the functionality provided out of box and via visual definition and that is only months away. Flume is less and less used. NiFi does what Flume does and much beyond. With NiFi writing the producers to Kafka is trivial. Think beyond your current use case. What other use cases can this enable?...

One more thing. For landing data to HBase you can still use NiFi and its Phoenix connector to HBase. Another scalable approach.

View solution in original post

1 REPLY 1

@Faruk Berksoz

Kafka - YES for all scenarios. Kafka is not for storing. Kafka is for transport. Your data still needs to land somewhere, e.g. As you mentioned that is HBase via Phoenix, but it could also be HDFS or Hive.

1. Yes. Flume is ok for ingest, but you still need something else to post to Kafka (Kafka Producer), e.g. KafkaConnect.

2. No. Spark Streaming is appropriate for consumer applications, not really for your use case which is about ingest and post to Kafka.

3. No. Same response as for #2

4. No. Storm is for appropriate for consumer applications, not really for your use case which is about ingest and post to Kafka.

5. Could work, not recommended.

The most common architectures are:

a) Flume-> KafkaConnect-> Kafka; consumer applications are built using either Storm or Spark Streaming. Other options are available, but less used.

b) Nifi -> Kafka -> Storm; consumer applications are built using Storm; this is Hortonworks DataFlow stack

c) Others (Attunity, Syncsort) -> Kafka -> consumer applications built in Storm or Spark Streaming

Since I am biased, I would say go with b) - Storm or Spark Streaming, or both. I'm saying that only because I am biased but because each of the components scale amazingly and because I used Flume before and don't want to go back there once I've seen what I can achieve with NiFi. Additionally, HDF will evolve an integrated platform for stream analytics with visual definition of flows and analytics requiring the least programming. You will be amazed of the functionality provided out of box and via visual definition and that is only months away. Flume is less and less used. NiFi does what Flume does and much beyond. With NiFi writing the producers to Kafka is trivial. Think beyond your current use case. What other use cases can this enable?...

One more thing. For landing data to HBase you can still use NiFi and its Phoenix connector to HBase. Another scalable approach.