Created on 04-27-201804:39 PM - edited 08-17-201907:42 AM
ETL With Lookups with Apache HBase and Apache NiFi
(Microservices Style ETL)
When we are ingesting tabular / record-oriented data, we often want to enrich the data by replacing ids with descriptions or visa-versa. There are many transformations that may need to happen before the data is in a happy state. When you are denormalizing your data in Hadoop and usually building very wide tables you often want descriptions or other data to enhance it's usability. Only one call to get everything you need is nice, especially when you have 100 trillion records.
I want to do my lookups against HBase which is a great NoSQL store for lookup tables and generate datasets.
First I created an HBase Table to use for lookups.
Create HBase Table For Lookups
create 'lookup_', 'family'
Table With Data
Most people would have a pre-populated table for lookups. I don't and since we are using a generator to build the lookup ids, I am building the lookup descriptions with a REST CALL at the same time. We could also have a flow that if you don't find the lookup add it, we could also have another flow ingesting the lookup values and add/update those when needed.
I found this cool API that returns a sentence of meat words. I use this as our description, because MEAT!
Call the Bacon API!!!
Let's turn our plain text into a clean JSON document
Then I store it in HBase as my lookup table. You probably already have a lookup table. This is a demo and I am filling it with my generator. This is not a best practice or a good design pattern. This is a lazy way to populate a table.
This is the magic. We take in our records, in this case we are reading JSON records and writing JSON records, we could choose CSV, AVRO or others. We connect to the HBase Record Lookup Service. We replace the current prod_desc field in the record with what is returned by the lookup. We use the id_product field as the lookup key. There is nothing else needed to change records in stream.
HBase Record Lookup Service
HBase Client Service Used by HBase Record Lookup Service
We can use UpdateRecord to cleanup, transform or modify any field in the records in stream.