Looking for some advice/guidance on designing an architecture solution for storing data in HBase.
Our current flow is this. NiFi -> Kafka -> Storm -> HBase.
This is working as expected, but as we receive more requirements, we have a need to be more flexible. Our HBase store is now going to be used to store a lot more information from different aspects of our company, requiring more HBase tables as we receive new requirements. I was looking into ways of designing a solution where we can create a generic Storm topology, which would take in the table name and other data from Kafka at run time, allowing us to dynamically pass in any data/table/column family. Our Storm topology main responsibility would then be to simply parse the input, and write to the table name it received as part of the Tuple message. However, I believe this is not advised as the HBase Bolt requires you to the pass in the table name in the prepare() method, which would not allow for this flexible solution I am after. Anyone have any other tools/ideas for this? Currently we would have to have a HBase topology, and any time we added a new table to HBase, update that topology with a new HBase Bolt. This is not the end of the world, and probably what we will go with if we don't find another way of doing it, but just seeing what else is out there.
Some requirements we are hoping to achieve:
1. A single point of entry to write to HBase. Means only one component needs maintenance/updating when versions need to be updated. Also provides other benefits (easier authorization to write data, audits etc.)
2. Separating data into 2 streams:
a. Raw data that needs be simply archived in HBase, no processing required on the data
b. Data that needs to go through some form of processing. We will be using Spark for a lot of this. The processed data would then be stored to HBase by the same archival solution.
I have looked into using NiFi, but I would prefer to use NiFi simply as our data ingestion/routing/model transformation tool, and keep writing data separate to another tool. NiFi could become unmanageable, as we added more and more tables and process groups. Spark might do it but it just seems overkill.
Any other guidance?
Current implementation of HBaseBolt requires table name to handle delegation token, hence we may not want to modify storm-hbase to have flexibility of table.
If you don't use security, it should be straightforward to implement your own based on current HBaseBolt. Both HBaseBolt and AbstractHBaseBolt require table name as well as mapper which doesn't handle table name, so you may want to just copy and modify the code a bit to remove requirement of table name and let mapper also handles table name as well as initializing HBaseClient instance per table name.
If you plan to deal with lots of table, you may also want to ensure that there're only couple of HBaseClient retained at once. Actually this may be the signal you would want to play with HBase API and implement custom Bolt.
I do need security, we have Kerberos implemented for HBase. Are you saying using the HBase Rest API if we need a lot of tables?
I didn't mean you need to use REST API. As well as storm-hbase dealing with kerberos authentication, custom implementation can deal with it. I just meant that it may not a good thing to generalize in storm-hbase, so you may need to implement custom one based on copying the code and modify.