what are best practices for "importing" streamed data from Kafka into HBase?
The usecase is as follows: Vehicle sensor data are streamed to Kafka. Afterwards, these sensordata must be transformed (i.e., deserialized from protobuf in humanreadable data) and stored in HBase.
1) Which toolset do you recommend (e.g., Kafka --> Flume --> HBase, Kafka --> Storm --> HBase, Kafka --> Spark Streaming --> HBase, Kafka --> HBase)
2) What is the best place for doing the protobuf deseralization (e.g., within Flume using interceptors)?
Thank you for your support.
Before I answer can I ask a few questions?
1) What is the volume in terms of messages per second?
2) How big is an individual message
3) Are there other transformation or processing that happens other than the protobuf deserialization?
4) Are all messages the same or are there multiple message types and /or topics?
5) Are you running your Hbase cluster with Kerberos enabled?
sure, see details below - please be aware that these are only rough estimations.
1) about 150 K per second
2) about 250 bytes (however the overall volume is expected to be only about 300 GB a day as not each vehicle is sending 24 hours a day)
3) maybe a transformation of an incoming data format to the "local" data format - however, maybe in the future this transformation is already done before data arrive in Kafka
4) there are about 2-5 topics all storing the same vehicle sensor information, however, in a different format (see also 3).
5) currently Kerberos is disabled, this may change in future as it stongly depends on the overall system architecture
Thank you very much for your support.
Exactly once would be the best case, however at least once would be fine as well (cleansing mechanisms could be done done afterwards)
Ok Thanks Thomas for the information, and sorry for the delayed response.
You are putting about 150k messages/second into Kafka, 250byte messages.
You would like to do some light processing/formatting, but not a complex topology.
There are a number of considerations for this architecture. On the HBase side, you'll need anywhere from 10-15 or so Region Servers at minimum. I've definitely seen a single Region Server do 30k writes/second, but with mixed reads as well this number can be lower. It's a safe bet to assume that a typical Region Server on bare-metal can do at least 10k sustained writes per second. Particulalry if you pre-split regions and have a good distribution of rowkeys across regions.
So make sure you have a good rowkey design that can be well-distributed, you can also kind of bucketthe rowkeys based on projected volume and # of region servers.
Regarding delivery semantics. There are all sorts of reasons why you can get duplicates in a Kafka-based pipeline, but using HBase actually mitigates this, assuming you have a unique ID per message, and that ID is part of your rowkey. Otherwise, as you mentioned you'll have to de-dup at some point.
In order to keep up with 150k messages per second, you'll need to do some testing to see how many messages you can read and then write to HBase per partition.
I'm not a huge fan of Storm, for a number of reasons, not the least of which is deploying and managing another distributed system for what amounts to a simple transformation (protobuf deserialization). Flume might be an ok choice, you could certainly code your transformation in the interceptor, but you'll likely need the same number of agents as region servers in order to process the the data effectively. Note the AsyncHBaseSink doesnt's support kerberos, so your mileage may vary. (Note I'm a pretty big fan of the Flume-Kafka integration in general, but it may not be appropriate for all use cases)
You could also look at the new Spark Streaming / Kafka integration availabe in Spark 1.3. This is nice because, as that doc states,
Simplified Parallelism: No need to create multiple input Kafka streams and union-ing them. With directStream, Spark Streaming will create as many RDD partitions as there is Kafka partitions to consume, which will all read data from Kafka in parallel. So there is one-to-one mapping between Kafka and RDD partitions, which is easier to understand and tune.
The other option is just to write a java application. A relatively simple standalone java process can do this with no problems, though you'd certainly need multiple threads and multiple processes from multiple machines. Still 150k/second should be achievable on 2-4 nodes if done correctly. You should probably read this if you roll your own.
In any of the cases, make sure you up your handler threads in HBase and tune HBase for writes. Also make sure that you have enough Kafka partitions to properly paralleize your consumption.
Let me know if you have other questions.
Hi Jeff, many thanks to your response.
Indeed, I have one more question, and I think, its a fundamental one. I'm asking myself whether my suggested solution architecture fits at all.
Maybe its better to introduce a separate stage for data storage after ingesting and computing required views afterwards (cf. batch- and serving layer of a typical lambda architecture), e.g. Kafka --> Flume --> HDFS -->Spark --> HBase
What do you think?
Currently, its not clear how the data will be accessed/queried later on by data scientists (we assume they will use tools like R, Spark machine learning or even pose some kind of SQL like queries). Also, we do not need random writes, i.e., only new data will be added, there are no updates. Therefore, I'm not sure whether we need HBase at all.
What is your opinion?