Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Kafka-HBase integration

Highlighted

Kafka-HBase integration

New Contributor

Hi,

Just a question...

Is it possible to store stream data obtained from Kafka in to HBase directly without using any processing logic of Spark/Storm?We dont need to apply any logic for the data obtained form Kafka,just need to store the data in HBase.

5 REPLIES 5

Re: Kafka-HBase integration

You will need some kind of "processing". You need some kind of KafkaConsumer to read the data and write it into HBase. Both of them are data stores so they don't do the import/export themselves. It doesn't have to be spark or storm It could also be a mapreduce job or simply a Java Consumer with an HBase output.

For MapReduce:

There is something called a KafkaInputFormat you could use to read data and write into an HBaseOutputFormat.

http://www.conductor.com/nightlight/data-stream-processing-bulk-kafka-hadoop/

Re: Kafka-HBase integration

Mentor
@Rohan kapadekar

absolutely you can do that. Write consumer code and wrap your inserts into the code. Here's an example I wrote you can use https://github.com/dbist/KafkaHBaseBenchmark

This is based on HBase 1.x API but not on new Kafka consumed API. It should still work. It doesn't use HBase bulk write as the goal was to test speed. Keep in mind that you need to make sure to handle reading from each Kafka partition yourslef, which Storm bolt took care of for you. You also will need to make sure every message is inserted, which Storm also took care of.

Re: Kafka-HBase integration

So this is a standalone Kafka Consumer that then writes the tuples into HBase? Cool. Out of interest what kind of throughput did you see?

Re: Kafka-HBase integration

Mentor

@Benjamin Leonhardi don't forget to tag my name, we get so many emails. So this is more code than necessary as I was reusing code from a storm topology. I was getting 15k/sec with every new consumer. So running many instances of this class will improve performance. That's what I meant when you read from Kafka, make sure you read from every partition. So in essense, if you have 8 partitions, you need to find a way to run 8 instances of this class. I went up to 45k/sec on a 4 node hbase and 3 node kafka cluster, I believe they were both co-located and this was in Azure.

Re: Kafka-HBase integration

Expert Contributor

Kafka is a Pub-Sub messaging queue, which means you need to design a Publisher (producer in Kafka terms) and Subscriber (consumer in Kafka term). A simple API can be used to create each of them in Java or Scala. A REST consumer also exists.

When it comes to ingesting data to Hadoop (HDFS, HBase, ...) Gobblin (formerly known as Camus) is the common answer. HBase can simply be used as a OutputFormat in such a scenario. Try the HDFS ingest example here: http://gobblin.readthedocs.org/en/latest/case-studies/Kafka-HDFS-Ingestion/